AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape

Lamprou, Evangelos; Marmouta, Aikaterini

doi:10.3390/journalmedia7010066

Open AccessArticle

AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape

by

Evangelos Lamprou

^* and

Aikaterini Marmouta

Department of Digital Media and Communication, Ionian University, 28100 Argostoli, Greece

^*

Author to whom correspondence should be addressed.

Journal. Media 2026, 7(1), 66; https://doi.org/10.3390/journalmedia7010066

Submission received: 31 December 2025 / Revised: 15 February 2026 / Accepted: 14 March 2026 / Published: 19 March 2026

Download

Browse Figures

Versions Notes

Abstract

The circulation of non-true stories in digital media environments presents ongoing challenges for journalism and fact-checking. The development of artificial intelligence has led to the use of AI chatbots in content verification processes. This study evaluates the performance of AI chatbot systems in identifying non-true stories within the Greek media context. A quantitative comparative research design was applied, using claims previously assessed by professional fact-checking organizations. Chatbot responses were compared with established verification verdicts to examine detection accuracy, variation across categories of non-true stories, and differences related to source characteristics. The results indicate that AI chatbots demonstrate measurable capability in identifying non-true stories, while also exhibiting limitations in specific content categories, particularly those involving complex or AI-generated material. Performance differences between chatbot systems suggest that design characteristics and task orientation influence verification outcomes. The findings support the view that AI-based tools function most effectively as components of broader verification processes in which human judgment remains essential.

Keywords:

verification; artificial intelligence; AI chatbots; non-true stories; fact-checking; journalism

1. Introduction

Modern journalism faces a variety of challenges, one of which is the wide dissemination of information. Rapid news circulation now comes primarily from the internet, and society receives vast amounts of content from multiple media sources. Moreover, individuals themselves can record and share information that may not have been verified. As a result, the journalist’s role in ensuring the accuracy of information published to the public has become increasingly important. Journalists are often referred to as “gatekeepers,” as their primary responsibility is to properly inform society (Lamprou et al., 2021). However, information from the internet frequently contains unverified elements, often referred to as “fake news.” The term itself is not new, according to the Collins Dictionary; it was declared the Word of the Year in 2017 (Anderau, 2021).

Journalism has traditionally been understood as a professional practice balancing between the roles of gatekeeping and advocacy (Janowitz, 1975). Within this professional framework, verification and accountability constitute core normative principles that differentiate journalism from other forms of public communication. The emergence of fake news and disinformation challenges these foundational norms by introducing systematically misleading content into the information environment. Scholarly efforts to define fake news emphasize its conceptual ambiguity and its overlap with related phenomena such as misinformation and disinformation (Tandoc et al., 2018). At the institutional level, international organizations and journalism educators increasingly frame disinformation as a structural threat to democratic communication and media trust (Ireton & Posetti, 2018). The phenomenon has also been examined within broader post-truth dynamics, in which factual accuracy competes with emotional, ideological, and identity-driven narratives. From this perspective, misinformation is not treated merely as isolated false claims, but as part of a wider epistemic environment shaped by political polarization, declining trust in institutions, fragmented media systems, and the emergence of alternative belief frameworks that challenge conventional standards of evidence (Rodríguez-Ferrándiz, 2023; Lewandowsky et al., 2017). Within such contexts, the impact of misinformation extends beyond individual misperceptions and contributes to broader transformations in public knowledge, trust, and democratic communication. Situating the present study within this theoretical lineage allows the role of AI-based verification tools to be examined not only as technical solutions, but as interventions within a long-standing journalistic struggle over truth and credibility.

Artificial intelligence (AI) also plays a significant role in people’s daily lives. Although a relatively recent innovation, AI already has applications across many domains, one of which is the detection of fake news. The importance of examining this phenomenon lies in the rapid development of AI and the widespread dissemination of false information. For this reason, it is essential to assess the effectiveness of AI chatbots in addressing an issue that directly affects the safeguarding of the public sphere.

AI chatbots may also contribute to reducing the spread of misinformation. Artificial intelligence has been increasingly applied to the detection and mitigation of misinformation across multiple communication environments. Early studies demonstrated the effectiveness of machine learning techniques in identifying misleading textual content at scale (Shrivastava et al., 2022), while more recent research has examined AI systems capable of detecting multimodal, visual, and AI-generated forms of disinformation (Lee & Shin, 2022). In parallel, applied research has explored the integration of AI-powered fact-checking tools within journalistic and platform-based workflows, highlighting both their potential and their limitations (Cantón-Correa et al., 2025). These developments underline the importance of empirically assessing how contemporary AI systems perform under real journalistic verification conditions.

The purpose of this research is to investigate the ability of AI chatbots to detect fake news. Rather than claiming absolute performance superiority of newer AI models, this study adopts a replication-and-diagnostic perspective, examining how and where AI chatbots succeed or fail in journalistic verification tasks. The study contributes to the growing literature on automated fact-checking in five ways. First, it provides the first large-scale empirical evaluation of AI chatbot-based verification within the Greek media ecosystem, extending prior work that has focused primarily on English-language contexts. Second, it offers a systematic temporal replication of previous chatbot evaluations (Caramancion, 2023) using a comparable methodological framework and a newer dataset (2025), enabling longitudinal assessment of progress in LLM-based verification. Third, the study empirically compares general-purpose AI chatbots with a task-specific customized verification system, demonstrating how structured prompting, tool integration, and alignment with professional fact-checking databases influence detection performance. Fourth, beyond overall accuracy, the analysis examines performance variation across misinformation categories and source types, revealing persistent weaknesses that remain obscured in aggregate metrics. Finally, the findings provide empirical support for hybrid fact-checking models, showing that AI systems are most effective when embedded within human-centered verification workflows rather than deployed as autonomous arbiters of truth. Accordingly, the study pursues three interrelated objectives: to evaluate the performance of contemporary AI chatbots in detecting professionally debunked non-true stories in the Greek media environment, to compare general-purpose systems with a customized fact-checking-oriented chatbot, and to examine how content and source characteristics shape automated verification outcomes.

2. Literature Review

In the literature, the terms fake news, misinformation, and disinformation are often used inconsistently. Following Wardle (2017), this study adopts a clear conceptual distinction: misinformation refers to false information shared without intent to cause harm, whereas disinformation denotes the deliberate creation and dissemination of false content. The term fake news is used more narrowly to describe fabricated or manipulated content that imitates journalistic formats. To ensure analytical consistency, the present study uses the umbrella term “non-true stories” when referring to all professionally debunked content examined in the dataset. The term “fake news” refers to a range of information types, from low-impact, honest mistakes and satire content to high-impact manipulative techniques and malicious fabrications (Kapantai et al., 2021). Fake news is not a new phenomenon, but its impact has been amplified in recent decades by the rise of digital media. The 2016 U.S. elections are considered a turning point, as Russia is accused of using fake social media accounts across platforms such as Facebook, Twitter, and YouTube to spread disinformation (Pechlivanidou, 2023). Previous research has documented widespread concerns about the role of misinformation in contemporary electoral processes, emphasizing the challenges it poses for democratic deliberation and public trust (Sukhodolov & Bychkova, 2017). Empirical case-based research further illustrates how misinformation circulates within specific national contexts and affects institutional trust and public communication processes, as shown, for example, in studies of COVID-19 misinformation and governmental communication strategies in Italy (Lovari, 2020).

The COVID-19 pandemic further highlighted the dangers of mis/disinformation. The World Health Organization and national governments issued guidelines to counter false claims, but social media became flooded with conspiracy theories and misleading content (Ahmad et al., 2022). In Greece, the arrival of the first COVID-19 case triggered a surge in public information-seeking, accompanied by myths, conspiracy theories, and misinformation (Patoucha, 2022). Such cases underline the potential societal harm caused by disinformation. Additional empirical research has shown how COVID-19 misinformation circulated across digital media ecosystems, influencing public health communication, institutional trust, and risk perception in multiple national contexts (Tasnim et al., 2020).

Furthermore, recent academic research has examined the structural conditions that facilitate the spread of non-true stories in Greece, emphasizing low levels of media trust, political polarization, and the prominence of emotionally framed narratives in online communication. Empirical studies show that disinformation in the Greek context is often linked to identity-based conflicts and politically sensitive issues, amplifying public mistrust and social fragmentation (Karyotakis, 2023). Comparative research in Southern Europe further indicates that Greece shares vulnerability patterns with other countries in the region, where fragmented media systems and declining institutional trust intensify the impact of misleading content (Sierra et al., 2024).

Another recent case is the Russia–Ukraine war. Since the 2022 invasion, Russia has employed tactics such as mixing real and false content, spreading disinformation through official channels, and using trolls to amplify propaganda (Pechlivanidou, 2023). These examples illustrate the growing geopolitical and social importance of fake news or non-true stories.

2.1. Defining Misinformation and Disinformation

Following Wardle (2017), misinformation refers to false information shared without intent to cause harm, whereas disinformation denotes the deliberate creation and dissemination of false content. In this study, the term “non-true stories” is used as an umbrella category reflecting the classification practices of professional fact-checking organizations and allowing analytical consistency across content types. Misinformation refers to the unintentional spread of false information, often due to errors, incomplete verification, or reliance on trusted but inaccurate sources (Lamprou et al., 2021; Wu et al., 2019). In contrast, disinformation is intentional, with content designed to mislead audiences (Akhtar et al., 2023). A related concept is malinformation, where true information is disclosed in a harmful way, aiming to disrupt society (Katsaounidou, 2020). Other related forms include rumors, spam, trolling, hate speech, and cyberbullying (Wu et al., 2019).

Fake news itself is a subtype of disinformation. It is presented in the stylistic form of journalism but contains wholly or partly false content (Sukhodolov & Bychkova, 2017). Some definitions emphasize its production outside professional journalism (Pepp et al., 2019). While the term “fake news” has historical roots dating back to the 16th century, it gained widespread attention during the 2016 U.S. elections (Katsaounidou, 2020). Both misinformation and disinformation campaigns are often orchestrated to shift public opinion, polarize people, breed paranoia, and spread conspiracies while facilitating malign cyber-activities, such as scamming and cyber-attacks (Gilbert & Gilbert, 2024).

2.2. Subtypes of Fake News

According to the European Association for Viewers Interests (EAVI, 2017), fake news includes propaganda, clickbait, sponsored content, satire, parody, conspiracy theories, fabricated content, hoaxes, and manipulated visuals. Clickbait relies on exaggerated, emotionally charged headlines, while satire and parody use humor but risk being misinterpreted. Propaganda and one-sided reporting often serve political or commercial interests (Zannettou et al., 2019). A growing form is deepfakes, synthetic audio–visual content generated by artificial intelligence using Generative Adversarial Networks (GANs). Initially developed for cinema, deepfakes now pose a major challenge in distinguishing authentic from fabricated material (Rodríguez-Pérez & Canel, 2023). Beyond content form, fake news subtypes also differ in terms of social function and situational dynamics. Crisis-related misinformation constitutes a distinct subtype, as it typically emerges during periods of uncertainty, high emotional intensity, and information scarcity. The COVID-19 pandemic has been widely examined as a paradigmatic case, demonstrating how false and misleading information can rapidly circulate under conditions of fear and urgency. Empirical research has shown that pandemic-related misinformation undermined trust in institutions, distorted risk perception, and complicated public communication strategies (Cinelli et al., 2020; Roozenbeek et al., 2020). This subtype of misinformation poses particular challenges for automated verification systems, as it often combines factual claims with emotional framing and rapidly evolving contexts, making timely and accurate verification especially difficult.

2.3. Tackling Fake News

Given the mistrust of information from traditional media and the exponential growth of misinformation on the internet, there is a need for verification mechanisms that contribute to improving access to information for citizens (Sidorenko Bautista et al., 2021). Efforts to combat fake news operate at individual, institutional, and technological levels. At the policy level, Greece introduced Law 5005/2022 to strengthen transparency, while the European Union adopted the Code of Practice on Disinformation (2018), signed by major platforms such as Meta, Google, and Microsoft (Zhang & Ghorbani, 2020). The European Fact-Checking Standards Network (EFCSN) is a pan-European network that brings together independent fact-checking organizations with the aim of promoting high professional, ethical, and methodological standards in fact-checking. The EFCSN establishes a shared Code of Standards that outlines principles related to independence, transparency, methodology, and accountability, and it supports collaboration, capacity building, and quality assurance among its members. Through these standards, the EFCSN seeks to strengthen the credibility and effectiveness of fact-checking practices across Europe and to enhance public trust in verified information (EFCSN, 2025). Social media platforms also apply self-regulation. Facebook collaborates with Ellinika Hoaxes to limit false stories, while X (formerly Twitter) launched initiatives for healthier digital conversations (Lamprou & Antonopoulos, 2020). Fact-checking organizations such as AFP Fact Check Greece classify and verify misleading content, often combining human expertise with automated tools. Crowdsourcing further supports fact-checking through strategies such as crowdfunding, crowdvoting, and collective intelligence (Lamprou et al., 2021). Finally, artificial intelligence chatbots and bots are emerging as tools to identify and filter false information at scale, complementing human-led verification.

2.4. Artificial Intelligence and Generative AI

Artificial intelligence (AI) has advanced rapidly and now supports multiple domains, including journalism. AI is commonly defined as the capacity of computer systems to perceive their environment, learn from data, and pursue goals (Veglis & Maniou, 2019; Lamprou & Antonopoulos, 2023). Other definitions emphasize the ability of a system to perform tasks traditionally carried out by humans (Ahmadi, 2024) or to imitate human behavior (Martinez, 2019). Some approaches distinguish between weak (narrow) AI systems specialized in specific tasks and strong (general) AI systems that aim to replicate human-level cognition and autonomy (Grewal, 2014; Martinez, 2019).

Legal and technical sources also refer to “electronic agents,” i.e., software-driven systems that can act autonomously within the limits of their programming. While developers remain responsible for the systems they create, these agents may execute tasks independently in defined contexts (Martinez, 2019).

At the policy level, the High-Level Expert Group on AI (HLEG) describes AI as human-designed software systems operating on structured or unstructured data and using rules and/or algorithms to select actions that achieve defined objectives (Samoili et al., 2020).

Taxonomies of AI commonly distinguish three broad types (Veglis & Maniou, 2019; Lamprou & Antonopoulos, 2023):

Analytical (cognitive): Reasoning and problem solving;
Human-inspired (emotional): Combines cognitive and affective elements;
Humanized: Integrates cognitive, emotional, and social intelligence.

Alternative distinctions separate narrow and general AI, with some frameworks adding superintelligence as a hypothetical stage surpassing human capacities (Samoili et al., 2020).

Generative AI (GenAI) denotes systems that can produce text, images, audio, or video from learned representations (Feuerriegel et al., 2023). Examples include DALL·E, GPT-4 and Copilot. GenAI models are typically built with deep neural networks and generative modeling techniques that learn data distributions and sample novel outputs. End-to-end applications often combine the model with data processing pipelines and user interfaces, enabling the use of cases from SEO content and code generation to creative tasks (Feuerriegel et al., 2023).

Despite their benefits, GenAI systems can also produce misleading content. Text generation by large language models (LLMs), synthetic images and video (e.g., deepfakes), and cloned audio can facilitate disinformation—often by blending authentic and fabricated elements or by combining modalities (Yu et al., 2024).

2.5. AI as a Tool for Detecting Fake News and Other Chatbots

OpenAI released ChatGPT in November 2022 (Roumeliotis & Tselikas, 2023; Wangsa et al., 2024). The GPT family comprises large language models optimized for natural-language generation; OpenAI also offers specialized models such as DALL·E for image generation, Whisper for speech-to-text, and Codex-style models for code. At a high level, modern LLMs are trained with transformer architectures on large text corpora, followed by fine-tuning to improve task performance. Pre-training uses next-token prediction; subsequent stages refine the model’s parameters on curated datasets (Roumeliotis & Tselikas, 2023).

Other companies have developed comparable systems. Google introduced Bard, later followed by Gemini, a multimodal model designed to handle a wide range of tasks and integrate with Google services (Wangsa et al., 2024). Meta released the Llama family, enabling downstream models with varying parameter sizes (Wangsa et al., 2024). In China, Baidu developed ERNIE (Wangsa et al., 2024). X (formerly Twitter) launched Grok, reported in multiple sizes (Wangsa et al., 2024). Many of these systems support or can be adapted for misinformation detection, retrieval-augmented verification, or assistance to human fact-checkers.

2.6. AI Methods for Detecting Fake News

AI enables scalable detection of non-true stories. Machine learning approaches classify content by learning patterns from labeled and unlabeled data (Bontridder & Poullet, 2021; Konstantinou, 2023; Patoucha, 2022).

Supervised/predictive learning uses input–output pairs and involves training and testing phases (Karapetsa-Lazaridou, 2024).
Unsupervised learning discovers structure in data without explicit labels (Konstantinou, 2023).

Natural Language Processing (NLP) supports text analysis tasks such as translation, speech recognition, emotion/sentiment analysis and generation (Konstantinou, 2023; Setiawan et al., 2022). Sentiment analysis helps quantify subjectivity and polarity in news content (Karapetsa-Lazaridou, 2024). Automated fact-checking combines NLP, machine learning, and evidence retrieval to compare claims against reliable sources (Katsaounidou, 2020; Karapetsa-Lazaridou, 2024). Platforms such as Full Fact have surveyed practitioners worldwide to assess the opportunities and limits of automation in fact-checking workflows (Nakov et al., 2021). Human expertise remains essential for framing, context, and editorial judgment, while AI tools improve speed, coverage, and consistency.

2.7. Prior Studies

Caramancion (2023), conducted a quantitative analysis assessing the ability of major AI chatbots, ChatGPT, Google Bard, and Bing Chat, to detect fake news using items previously evaluated by PolitiFact and Snopes. The study examined: (a) the extent to which large language models (LLMs) can distinguish factual from misleading information under controlled conditions, and (b) how their performance compares with established human-operated fact-checking services. Using data available up to September 2021, responses were coded according to verified fact-checker judgments. ChatGPT-4 achieved the highest accuracy (71%), followed by Bing (64%) and ChatGPT-3.5 (62%), with a mean performance of 65.25%. These results suggest moderate effectiveness and highlight persistent limitations in automated misinformation detection.

Expanding on this line of inquiry, Makhortykh et al. (2024) evaluated whether AI chatbots reproduce or resist Kremlin-aligned disinformation related to the Russia–Ukraine war. Perplexity, Bard, and Bing were assessed using 28 questions based on information available until October 2023. The study investigated whether the chatbots produced false information, included disclaimers and exhibited cross-response consistency. Over one-quarter of all outputs contradicted expert fact-check evaluations. Bard achieved the highest performance (73%), followed by Perplexity (64%) and Bing (56%). Accuracy varied significantly across thematic categories, particularly in politically sensitive domains such as casualty statistics.

Collectively, prior research reveals two central patterns: (a) current-generation chatbots show fluctuating but generally moderate reliability, and (b) their performance is highly contingent on political, contextual and linguistic factors. These findings informed both the rationale and design of the present study.

Beyond benchmark-style accuracy comparisons, recent research has examined automated fact-checking as part of broader professional and journalistic verification practices. Systematic reviews and meta-analyses highlight the growing integration of algorithmic tools into fact-checking workflows, while also emphasizing their methodological limitations and the continued importance of human editorial judgment (Dierickx et al., 2023; Kavtaradze, 2025). This broader line of research provides the conceptual background for the present study, which combines benchmark-style evaluation with diagnostic analysis across content categories and source characteristics.

3. Methodology and Research Questions

3.1. Present Study: Research Design and Research Questions

This study adopts a quantitative research design to evaluate the capacity of artificial intelligence (AI) chatbots to detect non-true stories within the contemporary Greek media ecosystem. Building on prior empirical work in the field of automated fact-checking, most notably the study by Caramancion (2023), the research seeks to assess both the overall accuracy of chatbot-based verification systems and their performance across different types of misleading content and source characteristics.

The decision to build the core evaluation framework on Caramancion (2023) is based on its role as one of the first systematic comparative studies assessing the fact-checking performance of large language model chatbots using professionally verified datasets. The study provides a clear, reproducible, and benchmark-oriented evaluation structure, relying on accuracy against fact-checker ground truth, which enables direct comparison across chatbot systems.

Adopting a comparable methodological logic allows the present study to function as a temporal and contextual replication, extending prior research to a different linguistic, national, and media environment. This design enables assessment of whether performance patterns observed in earlier studies persist under new data conditions and within a different media system.

At the same time, the analytical framework is expanded beyond the original benchmark design by incorporating category-level diagnostics, source-based analysis, and a comparison between general-purpose conversational systems and a task-specific fact-checking chatbot. This extension shifts the focus from simple performance comparison to a more explanatory evaluation of how system design, content type, and source characteristics influence automated verification outcomes. While overall accuracy serves as the primary metric to enable direct comparison with prior chatbot evaluation studies, the analytical design of this study is diagnostic rather than purely performance-driven. Accordingly, secondary analyses examine performance variation across categories of non-true stories, source types, and source visibility, as well as differences between general-purpose and task-specific verification workflows, in order to provide deeper insight into chatbot behavior beyond aggregate accuracy.

The research design is comparative and cross-sectional. Three AI-based systems were evaluated: ChatGPT (version 3.5), Gemini, and a customized verification system, the Greek Fact-check Bot. The latter was specifically developed for fact-checking purposes and incorporates structured prompts and alignment with professional fact-checking databases. All systems were tested using the same dataset and identical prompting conditions in order to ensure comparability and reduce methodological bias. The dataset consists of non-true stories that were professionally debunked by two established fact-checking organizations operating in Greece: Ellinika Hoaxes (2025a) and AFP Greece (2025). Each claim was evaluated by the three chatbots and subsequently coded as either correctly or incorrectly classified based on the official verdicts of the fact-checking organizations. Overall accuracy served as the primary evaluation metric, while secondary analyses examined chatbot performance across misinformation categories and source-related variables, including source type and website visibility.

By extending the analytical framework of Caramancion (2023) to a different media system, linguistic context, and time period, the present study enables both replication and comparison. This design allows for the assessment of whether observed performance patterns persist across contexts and whether advances in AI chatbot development translate into improved verification accuracy. Based on this research design, the study addresses the following research questions:

RQ1: To what extent do AI chatbot systems differ in their ability to accurately detect non-true stories when evaluated against professionally debunked claims in the Greek media context under identical experimental conditions?

RQ2: How does the performance of AI chatbots vary across different categories of non-true stories, including AI-generated content and manipulated media?

RQ3: How do the findings of the present study compare with previous empirical research on chatbot-based verification, particularly Caramancion (2023)?

RQ4: How do source characteristics (source category and source visibility) relate to chatbot detection performance?

3.2. Data and Period

The categories of non-true stories used in this study were derived from established professional fact-checking typologies and the previous academic literature and were adapted to the Greek media context. Specifically, the classification draws on the typologies proposed by the European Association for Viewers Interests (EAVI, 2017) and Wardle (2017), with additional refinements introduced to capture context-specific patterns observed in the Greek dataset. This hybrid approach ensures both theoretical grounding and empirical relevance. The dataset comprises all non-true stories that were professionally debunked by Ellinika Hoaxes and AFP Greece between 1 January and 20 April 2025.

The selected time frame (1 January–20 April 2025) reflects the most recent complete dataset available at the time of analysis and captures a period characterized by intensified public debate in Greece across political, economic, and social issues. During this period, major public events, including large-scale demonstrations related to the Tempi train disaster (Koukoumakas, 2025), generated heightened information flows and public attention, which were also reflected in the output of professional fact-checking organizations. These conditions make the selected time window suitable for evaluating chatbot performance under contemporary and high-intensity misinformation environments. Focusing on a clearly defined and recent period also ensures temporal consistency and methodological comparability across all evaluated systems.

Data collection, preparation, coding and descriptive statistical analysis were conducted using Microsoft Excel. Each incident was recorded as a single analytical unit and cross-checked to ensure consistency between sources. Incidents were classified according to the taxonomy employed by Ellinika Hoaxes, drawing on the categorization framework proposed by Lamprou et al. (2021). The classification includes misinformation (including propaganda), pseudoscience, scams (i.e., incidents posing financial risk), mixtures of factual content and distortions, conspiracy theories and additional subcategories identified during the coding process. In parallel, each item was categorized by source type, namely portals/blogs, newspapers, social media platforms and television or radio, in order to enable source-based comparative analysis. Although the initial dataset comprised 930 professionally debunked non-true stories, 14 incidents were excluded from the comparative chatbot analysis because at least one of the evaluated systems did not return a complete response. To ensure methodological consistency and paired statistical comparison across all three systems, only cases with valid outputs from all chatbots (N = 916) were retained for the final analysis. The dataset consists of professionally debunked non-true stories covering the dominant misinformation domains in Greece during the examined period, including politics, public health, science, migration, and social issues. These domains reflect recurring thematic patterns identified by professional fact-checking organizations and represent areas where misinformation carriessignificant societal impact.

Ellinika Hoaxes and AFP Greece were selected as data sources due to their institutional role within the Greek media ecosystem, high public visibility, and systematic use by major news organizations. Both organizations are members of international fact-checking networks and follow standardized verification procedures, ensuring consistency and reliability in the classification of non-true stories. Their prominence in the Greek information environment makes them particularly suitable for evaluating AI-based verification tools in a real-world national context. Both organizations are also signatories of the International Fact-Checking Network (IFCN) Code of Principles, which further supports their credibility, methodological transparency, and reliability as data sources for the present study (International Fact-Checking Network, 2025).

3.3. Systems Evaluated

Three AI chatbot systems were evaluated in this study: ChatGPT (version 3.5), Gemini, and the Greek Fact-check Bot. ChatGPT and Gemini represent general-purpose large language model-based chatbots designed for a wide range of conversational and informational tasks. In contrast, the Greek Fact-check Bot is a customized verification system developed by the research team and built on ChatGPT-4o. The Greek Fact-check Bot was specifically optimized for the detection of non-true stories through structured analytical routines and systematic cross-referencing with verified fact-checking databases. This differentiation allows for a comparative assessment between general-purpose conversational AI systems and a task-specific fact-checking chatbot designed for journalistic verification. ChatGPT (v3.5) was selected to reflect realistic newsroom accessibility and reproducibility, as it remains the most widely available model for journalistic practice. The customized Greek Fact-check Bot was built on GPT-4o in order to isolate the effect of task-specific workflow design and system integration, rather than to compare raw model performance across different LLM generations.

The selection of the examined chatbots was guided by their relevance to contemporary journalistic verification practices and their documented use in information-seeking and fact-checking contexts. General-purpose large language models such as ChatGPT and Gemini are increasingly adopted by journalists and citizens for rapid verification, background research, and claim assessment tasks (Makhortykh et al., 2024). At the same time, recent research highlights the limitations of generic AI systems when applied to professional fact-checking workflows, emphasizing the need for task-specific and domain-adapted tools (Nakov et al., 2021). The Greek Fact-check Bot was therefore included as a contrasting case of a newsroom-oriented system designed to align with professional fact-checking databases and verification routines. Examining these three systems together allows for a theoretically informed comparison between public-facing AI tools and workflow-integrated verification systems, highlighting the role of system design in automated verification performance.

3.4. Development and Functional Architecture of the Customized Fake-News Detection Chatbot

3.4.1. Conceptual Foundations and Instruction Framework

The Greek Fact-check Bot was designed as a specialized AI system aimed at evaluating news credibility and identifying non-true stories across Greek and international media environments. The architecture and verification workflow of the Greek Fact-check Bot are summarized in Scheme 1. Its architecture combines custom system instructions, functional tool integrations, behavioral constraints and GPT-4-based reasoning capabilities (Roose, 2023). A detailed instruction framework defines the chatbot’s role as a fact-checking assistant, requiring the application of logical reasoning, scientific methodology, and transparency through explicit source citation. To support dynamic verification, the chatbot integrates multiple functional tools. A web-browsing module enables the retrieval of real-time, context-dependent information essential for assessing claims linked to rapidly evolving events. Document-processing tools facilitate the analysis of long-form textual content, such as articles, reports, and socialmedia posts. This integration enables the system to function not only as a generative model but also as a hybrid verification environment capable of engaging with external data sources.

3.4.2. Behavioral Calibration and Analytical Output

The chatbot operates under predefined behavioral parameters designed to ensure neutrality, consistency and methodological rigor. It avoids emotional or evaluative language and produces structured, evidence-based explanations. A standardized 1–10 reliability scale is incorporated to quantify credibility assessments, enabling systematic comparison across evaluated items. GPT-4 functions as the linguistic and inferential core of the system, supporting advanced contextual understanding and structured analytical reasoning guided by the instruction framework and integrated tools.

3.4.3. Interaction Logic and Verification Workflow

User interaction is facilitated through predefined guided prompts, such as “What is the credibility rating of this article?” and “Can you check the validity of this article?”, which activate specific verification pathways. These guided routines ensure consistency and reproducibility across assessments. Each output includes (a) a structured explanation of the system’s reasoning, (b) citations to relevant sources, and (c) a numerical credibility rating, offering transparency and enabling both qualitative and quantitative evaluation.

Scheme 1. Customized AI chatbot “Greek Fact-check Bot”.

3.4.4. Adaptation and Iterative Refinement

Although structured around stable methodological parameters, the system supports iterative refinement. Updates may involve expanding the set of fact-checking sources, adjusting custom instructions, integrating additional tools, or recalibrating the reliability scale. This adaptive capability ensures that the system remains responsive to evolving non-true story patterns and verification standards.

3.5. Research Procedure

The research procedure was designed to ensure methodological consistency, transparency and comparability among the three chatbot systems examined in this study. A step-by-step overview of the evaluation procedure is provided in Scheme 2. To minimize systematic bias, all models were tested on the same dataset, under equivalent prompting conditions, and following a uniform evaluation protocol. For the Greek Fact-check Bot, hyperlinks to the original debunked items were provided whenever available, allowing the system to retrieve the full context of each incident. In cases where a hyperlink could not be used, either due to formatting constraints or inactive sources, the complete textual form of the claim was supplied to ensure that all evaluative information was preserved. Each item was processed through the system’s dedicated preset prompt, “Can you check the validity of this article?”, which activates its built-in fact-checking workflow.

ChatGPT, Gemini and the Greek Fact-check Bot were evaluated under a parallel procedure that sought to approximate typical user interactions when attempting to verify suspicious online content. All systems received the claims exactly as they appeared on Ellinika Hoaxes, along with the verbatim text of relevant posts archived by AFP Greece. This ensured that all chatbots were exposed to the same information available to human fact-checkers. The standardized prompt used for these systems, “Is the following claim valid?”, was chosen for its clarity, neutrality and suitability for eliciting direct verification responses.

All responses produced by the three chatbots were systematically collected and archived in a supplementary Word file, forming the basis for subsequent coding and analysis. Each output was classified as correct or incorrect by comparing it directly to the official judgments of Ellinika Hoaxes and AFP Greece, which served as the ground truth for the study. Overall accuracy was used as the primary metric, while optional secondary analyses by misinformation category and source type allowed for more granular examination of performance patterns. Descriptive statistics, including mean accuracy and 95% confidence intervals, were calculated to provide a clear summary of each system’s performance.

To assess differences in accuracy across the three chatbots, Cochran’s Q test was employed, as it is suitable for comparing paired binary outcomes across multiple related systems. If statistically significant differences were detected, McNemar’s test was used for pairwise comparisons, with Holm correction applied to account for multiple testing. These statistical procedures enabled a rigorous and balanced evaluation of model performance.

Reproducibility was ensured through a set of controlled experimental conditions. Inference parameters, including temperature, top-p values and random seeds, were fixed for all models to reduce stochastic variability. All prompts, both system-level and user-level, were archived to ensure full traceability of inputs. Furthermore, the order of claim presentation was randomized separately for each chatbot to minimize potential ordering effects, such as response fatigue or positional bias.

Through these methodological safeguards, the evaluation framework ensured a robust, transparent and replicable assessment of chatbot accuracy in detecting misinformation.

4. Results and Analysis

The total number of debunked incidents is 930, of which 533 derive from Ellinika Hoaxes and 397 from AFP Greece, respectively (Table 1).

The dataset comprised a total of 930 claims, which were classified into distinct misinformation-related categories based on the verdicts provided by professional fact-checking organizations. Table 2 presents the distribution of claims across categories, including absolute frequencies and relative percentages.

The largest category was misinformation (n = 243; 26.1%), followed by false claims (n = 215; 23.1%) and fake news (n = 117; 12.6%). A substantial proportion of cases involved content created with artificial intelligence (n = 81; 8.7%) and instances where thematic content was missing (n = 69; 7.4%).

Additional categories included misleading content n = 49; 5.3%), incomplete framing (n = 48; 5.2%) and modified images (n = 31; 3.3%). Less frequent classifications consisted of conspiracy theories (n = 24; 2.6%), mixtures of factual and false information (n = 24; 2.6%) and pseudoscience (n = 12; 1.3%).

Rare categories included satire (n = 6; 0.6%), scams (n = 4; 0.4%), dangerology (n = 4; 0.4%), modified videos (n = 2; 0.2%) and false sayings or quotations (n = 1; 0.1%).

Furthermore, the incidents’ sources were categorized into portal/blog, newspaper, social media, and tv/radio according to the methodology of Lamprou et al. (2021). Table 3 presents the distribution of misinformation incidents by source type. Nearly half of the cases originated from portals and blogs (n = 448; 48.2%), followed closely by social media platforms (n = 389; 41.8%). Traditional media sources accounted for a considerably smaller share of incidents, such as with newspapers (n = 19; 2.0%) and television and radio (n = 15; 1.6%). In 6.3% of cases (n = 59), information regarding the original source was unavailable.

As displayed in Table 4, the websites were ranked using the Similarweb Top 50 traffic scale, which provides an estimate of overall website traffic based on general web rankings rather than rankings limited to entertainment or informational content. This metric was used to assess the relative visibility of websites associated with mis/disinformation incidents. Based on this ranking, 53 incidents were linked to websites classified as high-traffic sources, defined as those appearing within the Similarweb Top 50 (Similarweb, 2025). These incidents represent 11% of the total number of website-based cases analyzed (N = 482).

The final stage of the study examined the ability of chatbots to correctly assess the validity of news-related claims. The systems evaluated were ChatGPT (version 3.5), Gemini, and the Greek Fact-check Bot. For each system, responses were coded as either correct or incorrect based on their alignment with the verdicts of professional fact-checking organizations. Table 5 presents the number of correct and incorrect responses produced by each chatbot across the full set of evaluated claims. Overall, the Greek Fact-check Bot produced the highest number of correct assessments, followed by ChatGPT (3.5), while Gemini exhibited the lowest accuracy among the three systems. These findings highlight meaningful differences in chatbot performance when applied to journalistic fact-checking tasks.

To examine whether differences in accuracy between the three chatbot systems were statistically significant, Cochran’s Q test was applied to the paired binary outcomes. The analysis revealed statistically significant differences in accuracy across the three systems (p < 0.001). Following this result, posthoc pairwise comparisons were conducted using McNemar’s test with Holm correction to control for multiple comparisons. These analyses confirmed that the Greek Fact-check Bot performed significantly better than both Gemini and ChatGPT, while ChatGPT also significantly outperformed Gemini. This analytical sequence provides statistical support for the performance differences reported in the descriptive results.

Chatbot Performance Evaluation Results

Beyond overall accuracy, the analysis focuses on identifying systematic variation in chatbot performance across content categories and source-related characteristics, revealing structural strengths and weaknesses that are not captured by aggregate performance metrics alone. The evaluation of chatbot performance was conducted on a dataset of 916 debunked news claims for which complete responses were available from all three systems. Overall accuracy results indicate that the Greek Fact-check Bot achieved the highest performance, correctly classifying 77.5% of the evaluated claims. ChatGPT (v3.5) followed with an accuracy of 73.9%, while Gemini demonstrated the lowest overall accuracy at 64.1%. These results confirm that all examined chatbots were able to detect non-true stories to a considerable extent, although notable differences in performance were observed across systems.

As displayed in Table 6, when performance was examined across different misinformation categories, substantial variation emerged. In categories such as misinformation, fake news and false claims, all three chatbots achieved moderate to high accuracy levels, with the Greek Fact-check Bot consistently ranking among the highest-performing systems. In the category of misleading content, the Greek Fact-check Bot demonstrated notably higher accuracy compared to ChatGPT and Gemini. Detection accuracy for incomplete framing was relatively high for ChatGPT, moderate for Gemini, and lower for the Greek Fact-check Bot.

Across all systems, the lowest accuracy rates were recorded in the detection of AI-generated content. ChatGPT and Gemini showed particularly limited success in this category, whereas the Greek Fact-check Bot achieved substantially higher accuracy, though still below perfect classification. In categories involving manipulated visual material, such as modified images and modified videos, the Greek Fact-check Bot achieved the highest accuracy, including perfect classification in modified video cases.

Performance in categories with smaller numbers of incidents, such as satire, scams, fear-based misinformation, and false quotes, varied considerably. In these categories, the Greek Fact-check Bot generally achieved higher accuracy than the general-purpose chatbots, while ChatGPT and Gemini displayed lower and more inconsistent results. Despite category-level variability, the relative performance ranking of the three systems remained largely consistent across categories, with the Greek Fact-check Bot outperforming ChatGPT, and ChatGPT outperforming Gemini in most cases.

Finally, as displayed in Table 7, analysis of cases originating from high-traffic websites, as identified through the Similarweb Top 50 ranking, showed that chatbot performance patterns remained comparable to those observed in the overall dataset. The Greek Fact-check Bot again demonstrated the highest accuracy, followed by ChatGPT and Gemini. These findings indicate that misinformation detection challenges persist regardless of the visibility or popularity of the source.

Table 8 presents the classification performance of the three chatbot systems across different source categories, including portals/blogs, social media, newspapers, and television or radio. Across all source types, the Greek Fact-check Bot achieved the highest accuracy, followed by ChatGPT (v3.5), while Gemini consistently demonstrated lower performance. This ranking was observed for both digital-native sources and traditional media outlets.

Accuracy was highest for content originating from social media and broadcast media, while lower performance was observed for portals/blogs and newspapers. These results indicate that chatbot effectiveness varies by source category and suggest that the media origin of content is associated with differences in automated detection performance.

5. Discussion

Although improvements in AI chatbot performance over time are expected given the rapid development of large language models, the purpose of the present study is not to demonstrate progress in isolation, but to examine how such progress manifests across different verification contexts. Rather than treating increased accuracy as a primary contribution, the study uses performance differences as an analytical lens to identify structural strengths and limitations of chatbot-based verification systems in real-world journalistic environments. The findings show that performance gains are uneven and strongly dependent on content characteristics, source type, and system design. While accuracy improves in well-structured, text-based claims, persistent weaknesses remain in categories involving AI-generated content, manipulated visuals, and context-dependent misinformation. For example, in several cases involving AI-generated or visually manipulated content, chatbots either failed to recognize synthetic elements or relied primarily on surface-level textual cues without detecting underlying distortions. One case involved a viral image of a burned Oscar statuette circulated in connection with the California wildfires, which was later verified as an AI-generated image despite being presented as authentic visual evidence (Ellinika Hoaxes, 2025b). In another instance, an image showing a protester dressed as Pikachu during demonstrations in Turkey was also found to be AI-generated, although it had been widely shared as genuine footage (Ellinika Hoaxes, 2025c). Such cases illustrate persistent difficulties in handling multimodal, context-dependent, or technically complex forms of misinformation, which require deeper verification routines beyond surface-level textual analysis.

This pattern indicates that technological progress does not uniformly translate into verification reliability and highlights the importance of evaluating AI systems beyond aggregate performance metrics. By demonstrating where improvements occur and where limitations persist, the study contributes explanatory insight rather than incremental benchmarking alone. In this sense, the expected nature of overall improvement strengthens, rather than weakens, the contribution of the study, as it allows for systematic analysis of the conditions under which AI-based verification succeeds or fails.

The findings related to RQ1 demonstrate that AI chatbot systems differ meaningfully in their ability to detect non-true stories when evaluated against professionally debunked claims under identical experimental conditions. All three examined systems—ChatGPT (v3.5), Gemini, and the Greek Fact-check Bot—correctly classified a substantial proportion of professionally debunked claims, confirming that contemporary AI systems have reached a level of maturity that allows them to meaningfully support journalistic verification processes. At the same time, the presence of consistent misclassifications across all systems indicates that chatbots are not yet capable of fully autonomous verification and require human oversight.

As depicted in Figure 1, clear differences were observed between the examined systems, with the Greek Fact-check Bot achieving the highest accuracy, followed by ChatGPT (v3.5), while Gemini exhibited the lowest performance. These differences indicate that task-specific customization and workflowdesign play a decisive role in improving verification outcomes. The superior performance of the Greek Fact-check Bot should not be interpreted as a claim of practical superiority based on marginal accuracy gains, but rather as evidence that structured prompting, system integration, and alignment with professional fact-checking databases can produce measurable improvements even when the underlying language model is comparable. This finding highlights the importance of system design choices in hybrid fact-checking environments rather than model-level optimization alone.

With respect to RQ2, the analysis revealed that chatbot performance varies substantially across different categories of non-true stories. ChatGPT showed stronger performance in narrative-based and contextual categories, such as misinformation and incomplete framing, but struggled with technically complex categories, particularly AI-generated content. Gemini demonstrated relatively higher accuracy in narrowly defined categories, including conspiracy theories and pseudoscience, while underperforming in several core journalistic categories, such as misleading content, satire, and scams. Across all systems, categories with limited numbers of cases should be interpreted cautiously; nevertheless, the findings clearly indicate that the nature of the misinformation strongly influences chatbot effectiveness. These patterns are consistent with prior research showing that automated fact-checking performance is highly contingent on content type, contextual complexity, and system design (Nakov et al., 2021; Makhortykh et al., 2024), reinforcing the need for diagnostic evaluation approaches alongside aggregate accuracy metrics.

Regarding RQ3, comparison with previous empirical research, particularly Caramancion (2023), as presented in Figure 2, indicates a general improvement in chatbot performance over time. Accuracy levels observed in the present study are higher than those reported in earlier evaluations, suggesting ongoing technological progress in AI-based fact-checking. At the same time, the comparison confirms that even improved systems continue to exhibit systematic weaknesses, reinforcing the need for cautious and critical deployment in journalistic contexts. Cross-study comparisons should be interpreted cautiously, as differences in language, dataset composition, temporal context, and prompt design can substantially influence chatbot performance independently of underlying model architecture. Accordingly, the higher accuracy observed for ChatGPT (v3.5) in the present study should not be interpreted as evidence of superiority over earlier GPT-4 evaluations, but rather as a context-specific outcome of methodological and data-related factors.

The findings related to RQ4 indicate that chatbot detection performance is shaped more strongly by media source category than by source visibility. Non-true stories originating from high-visibility sources are detected with comparable effectiveness to those from less prominent outlets, suggesting that automated verification systems do not inherently privilege content based on its public exposure. In contrast, higher detection accuracy is observed for content originating from traditional media sources, such as newspapers and broadcast outlets, compared to content disseminated through portals/blogs and social media platforms. This pattern likely reflects differences in content structure, linguistic formalization, and contextual framing across media environments, which may facilitate or hinder automated assessment. Overall, these findings underscore the importance of incorporating source characteristics into the design and evaluation of AI-assisted fact-checking systems.

These findings suggest that, despite measurable improvements in automated detection, AI chatbots remain limited in their ability to replicate human capacities such as contextual judgment, ethical reasoning, and critical interpretation. These limitations should be understood as interpretative implications of the observed error patterns rather than as directly measured deficits. Consequently, the results reinforce the importance of hybrid fact-checking models in which AI systems operate as assistive tools within human-centered verification workflows. Frameworks such as Veri|Fusion (Lamprou & Antonopoulos, 2023) exemplify this approach by integrating automated detection, crowdsourced input, and professional human oversight, with humans retaining the final decision-making role.

6. Conclusions

Across all evaluated dimensions, AI chatbots exhibited meaningful levels of accuracy, indicating that contemporary large language models are capable of supporting journalistic verification tasks. Compared to earlier empirical studies, such as Caramancion (2023), the performance of ChatGPT and Gemini suggests incremental progress in automated detection capabilities. The enhanced performance of the customized Greek Fact-check Bot further highlights the benefits of task-specific design, structured prompting and alignment with professional fact-checking databases. These findings suggest that specialization and contextual adaptation significantly enhance the effectiveness of AI-based verification tools.

The findings further show that while source visibility does not substantially affect detection accuracy, the category of the media source plays a more significant role, with lower performance observed for content disseminated through portals, blogs, and social media compared to traditional media formats. Overall, these limitations highlight that AI chatbots lack essential journalistic competencies, such as editorial judgment, ethical reasoning, and contextual interpretation. Consequently, AI-based systems should be understood as supportive tools within human-centered verification processes rather than as autonomous substitutes for professional fact-checking. These findings underscore the importance of adopting hybrid fact-checking models that combine automated systems with human expertise. Rather than replacing journalists or professional fact-checkers, AI chatbots should be positioned as assistive technologies that enhance efficiency, scalability and preliminary filtering. In such models, AI systems can support the identification of potentially misleading content, facilitate evidence retrieval and assist in categorizing claims, while humans retain the primary and final authority over verification decisions.

From a broader perspective, the findings carry important implications for journalism and media organizations. As misinformation continues to evolve, particularly through the use of generative AI, newsrooms and fact-checking organizations must invest not only in advanced technological tools but also in institutional frameworks that safeguard editorial responsibility. Hybrid systems offer a pragmatic pathway forward, enabling media professionals to leverage AI innovations while preserving the normative role of journalism in protecting the public sphere.

In conclusion, while AI chatbots have made, as expected, notable progress in detecting non-true stories, they cannot yet function as independent fact-checkers. The future of effective verification lies in human-centered, hybrid models, where artificial intelligence supports, but does not supplant, the critical role of human judgment. Furthermore, researchers insist that there are legal reasons why humans need to be kept in the loop for content moderation. According to a significant study funded by the European Science-Media Hub, limiting the automated execution of decisions on AI-discovered problems is essential in ensuring human agency and natural justice: the right to appeal. That does not prevent the suspension of bot accounts at scale, but ensures the correct auditing of the system processes deployed (Marsden & Meyer, 2019; Kertysova, 2018). Such an approach ensures both technological efficiency and democratic accountability in the ongoing effort to combat non-true stories.

Overall, the findings indicate that the effectiveness of AI-assisted verification depends less on raw model capability and more on system design, workflow integration, and contextual adaptation. The comparative results suggest that task-specific systems aligned with professional fact-checking practices can outperform general-purpose conversational models in real-world verification scenarios. These insights support the development of hybrid verification environments in which automated tools function as support systems within human-centered editorial processes, rather than as autonomous substitutes for professional fact-checking.

7. Limitations

Despite its contributions, the present study is subject to several limitations that should be acknowledged. First, some categories of non-true stories (e.g., satire, scams, modified video, false quotations) contain very small numbers of cases and therefore do not allow for robust statistical generalization. Findings related to these categories should be interpreted cautiously and are reported for descriptive completeness rather than inferential comparison.

Second, the dataset is limited to non-true stories debunked by two professional fact-checking organizations operating in Greece, namely Ellinika Hoaxes and AFP Greece. Although both organizations follow established verification standards, the findings may not fully capture the diversity of misinformation circulating beyond the scope of these platforms or in other media systems.

Third, the analysis focuses on a specific time period. Given the rapid evolution of both misinformation practices and AI chatbot capabilities, the results reflect a snapshot in time and may not be directly generalizable to future iterations of AI models or to different socio-political contexts. Fourth, some non-true stories categories included a relatively small number of cases, which limits the robustness of category-level comparisons. Similarly, the evaluation relied on binary correctness coding (correct/incorrect), which, while methodologically appropriate for comparative analysis, does not capture nuances such as partial correctness, uncertainty, or explanatory quality of chatbot responses.

Furthermore, the study evaluates chatbot performance under controlled prompting conditions designed to ensure comparability. While this enhances internal validity, it may not fully reflect the diversity of real-world user interactions with AI chatbots. Future research could explore more interactive or iterative verification scenarios and examine how human–AI collaboration unfolds in practical newsroom environments. Finally, the evaluation relies on binary accuracy (correct/incorrect) as the primary performance metric. While this enables direct comparison with prior chatbot evaluation studies, it does not capture qualitative aspects of chatbot reasoning, degrees of uncertainty, or partial correctness of responses. Future research could incorporate graded scoring, error typologies, or explanation-based evaluation in order to provide deeper insight into chatbot verification behavior.

Author Contributions

Conceptualization, E.L.; methodology, E.L.; software, E.L.; validation, E.L.; formal analysis, E.L.; investigation, E.L. and A.M.; resources, E.L. and A.M.; data curation, E.L. and A.M.; writing—original draft preparation, E.L. and A.M.; writing—review and editing, E.L.; visualization, E.L. and A.M.; supervision, E.L.; project administration, E.L.; funding acquisition, E.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All fact-checking articles with cases of non-true stories that have been used are publicly accessible on the websites https://www.ellinikahoaxes.gr and https://factcheckgreek.afp.com/ (accessed on 13 March 2026).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 5 for the purposes of translation, language correction and graph optimization. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

AFP Greece. (2025). Available online: https://factcheckgreek.afp.com/AFP-Greece (accessed on 1 June 2025).
Ahmad, T., AliagaLazarte, E. A., & Mirjalili, S. (2022). A systematic literature review on fake news in the COVID-19 pandemic: Can AI propose a solution? Applied Sciences, 12(24), 12727. [Google Scholar] [CrossRef]
Ahmadi, A. (2024). Unravelling the mysteries of hallucination in large language models: Strategies for precision in artificial intelligence language generation. Asian Journal of Computer Science and Technology, 13(1), 1–10. [Google Scholar] [CrossRef]
Akhtar, P., Ghouri, A. M., Khan, H. U. R., Amin ul Haq, M., Awan, U., Zahoor, N., Khan, Z., & Ashraf, A. (2023). Detecting fake news and disinformation using artificial intelligence and machine learning to avoid supply chain disruptions. Annals of Operations Research, 327(2), 633–657. [Google Scholar] [CrossRef]
Anderau, G. (2021). Defining fake news. KRITERION–Journal of Philosophy, 35(3), 197–215. [Google Scholar] [CrossRef]
Bontridder, N., & Poullet, Y. (2021). The role of artificial intelligence in disinformation. Data & Policy, 3, e32. [Google Scholar] [CrossRef]
Cantón-Correa, F.-J., Ballesteros-Aguayo, L., & Montoro-Montarroso, A. (2025). A fact-checking tool based on artificial intelligence to fight disinformation on Telegram. Communication & Society, 38(1), 247–262. [Google Scholar] [CrossRef]
Caramancion, K. M. (2023). News verifiers showdown: A comparative performance evaluation of ChatGPT 3.5, ChatGPT 4.0, Bing AI, and bard in news fact-checking. In 2023s IEEE future networks world forum (FNWF) (pp. 1–6). IEEE. [Google Scholar] [CrossRef]
Cinelli, M., Quattrociocchi, W., Galeazzi, A., Valensise, C. M., Brugnoli, E., Schmidt, A. L., Zola, P., Zollo, F., & Scala, A. (2020). The COVID-19 social media infodemic. Scientific Reports, 10, 16598. [Google Scholar] [CrossRef]
Code of Practice on Disinformation. (2018). Shaping Europe’s digital future. (n.d.). Available online: https://digital-strategy.ec.europa.eu/en/library/2018-code-practice-disinformation (accessed on 17 May 2025).
Dierickx, L., Lindén, C. G., & Opdahl, A. L. (2023). Automated fact-checking to support professional practices: Systematic literature review and meta-analysis. International Journal of Communication, 17, 21. [Google Scholar]
EAVI. (2017). Fake news and disinformation: Definitions and typologies. European Association for Viewers Interests. [Google Scholar]
Ellinika Hoaxes. (2025a, January 10). Available online: https://www.ellinikahoaxes.gr/ (accessed on 10 January 2025).
Ellinika Hoaxes. (2025b, January 28). Product of artificial intelligence the image of the burned Oscar statuette shared by Isabella Rossellini. Ellinika Hoaxes. Available online: https://www.ellinikahoaxes.gr/2025/01/28/proion-technitis-noimosynis-i-eikona-me-to-kameno-agalmatidio-tou-oskar-pou-anartise-i-izabela-roselini/ (accessed on 25 May 2025).
Ellinika Hoaxes. (2025c, March 31). Image showing the viral protester dressed as Pikachu in Turkey is AI-generated. Ellinika Hoaxes. Available online: https://www.ellinikahoaxes.gr/2025/03/31/eikona-pou-deichnei-ton-viral-diadiloti-pou-ntythike-pikatsou-stin-tourkia-apotelei-proion-ai (accessed on 25 May 2025).
European Fact-Checking Standards Network. (2025). European code of standards for independent fact-checking organizations. Available online: https://www.efcsn.com/ (accessed on 25 May 2025).
Feuerriegel, S., Hartmann, J., Janiesch, C., & Zschech, P. (2023). Generative AI. Business & Information Systems Engineering, 66(1), 111–126. [Google Scholar] [CrossRef]
Gilbert, C., & Gilbert, M. A. (2024). The role of Artificial Intelligence (AI) in combatting deepfakes and digital misinformation. International Research Journal of Advanced Engineering and Science, 9(4), 170–181. [Google Scholar]
Grewal, D. S. (2014). A critical conceptual analysis of definitions of artificial intelligence as applicable to computer engineering. IOSR Journal of Computer Engineering, 16(2), 9–13. [Google Scholar] [CrossRef]
International Fact-Checking Network. (2025). IFCN code of principles. Poynter Institute. Available online: https://ifcncodeofprinciples.poynter.org/ (accessed on 22 May 2025).
Ireton, C., & Posetti, J. (Eds.). (2018). Journalism, “fake news” & disinformation: Handbook for journalism education and training. UNESCO. [Google Scholar]
Janowitz, M. (1975). Professional models in journalism: The gatekeeper and the advocate. Journalism Quarterly, 52(4), 618–626. [Google Scholar] [CrossRef]
Kapantai, E., Christopoulou, A., Berberidis, C., & Peristeras, V. (2021). A systematic literature review on disinformation: Toward a unified taxonomical framework. New Media & Society, 23(5), 1301–1326. [Google Scholar] [CrossRef]
Karapetsa-Lazaridou, R. (2024). Identification of fake news in social media: Modern techniques and applications [Bachelor thesis, University of Piraeus]. [Google Scholar] [CrossRef]
Karyotakis, M.-A. (2023). Disinformation and weaponized communication: The spread of ideological hate about the Macedonian name in Greece. Journalism, 25(3), 673–691. [Google Scholar] [CrossRef]
Katsaounidou, A. (2020). Interactive and collaborative environments to support digital content authentication [Ph.D. thesis, Aristotle University of Thessaloniki]. [Google Scholar] [CrossRef]
Kavtaradze, L. (2025). Dominant disciplinary and thematic approaches to automated fact-checking: A scoping review and reflection. Digital Journalism, 13(9), 1552–1577. [Google Scholar] [CrossRef]
Kertysova, K. (2018). Artificial intelligence and disinformation: How AI changes the way disinformation is produced, disseminated, and can be countered. Security and Human Rights, 29(1–4), 55–81. [Google Scholar] [CrossRef]
Konstantinou, M. (2023). Fake news detection using graph neural networks and natural language processing techniques [Master’s thesis, National Technical University of Athens]. [Google Scholar] [CrossRef]
Koukoumakas, K. (2025, February 28). Greeks hold mass protests demanding justice after tempi train tragedy. BBC News. Available online: https://www.bbc.com/news/articles/cx2gg8le1kpo (accessed on 25 May 2025).
Lamprou, E., & Antonopoulos, N. (2020). Fake news, crowdsourcing and media outlets in Greece: Is news credibility a matter of professionalism. In 11th Asian conference on media, communication & film, the Kyoto conference on Arts, Media & Culture, Kyoto, Japan (pp. 12–14). The International Academic Forum. [Google Scholar]
Lamprou, E., & Antonopoulos, N. (2023). Ranked by truth metrics: A new communication method approach, on crowd-sourced fact-checking platforms for journalistic and social media content. Studies in Media and Communication, 11(6), 231–243. [Google Scholar] [CrossRef]
Lamprou, E., Antonopoulos, N., Anomeritou, I., & Apostolou, C. (2021). Characteristics of fake news and misinformation in Greece: The rise of new crowdsourcing-based journalistic fact-checking models. Journalism and Media, 2(3), 417–439. [Google Scholar] [CrossRef]
Lee, J., & Shin, S. Y. (2022). Something that they never said: Multimodal disinformation and source vividness in understanding the power of AI-Enabled deepfake news. Media Psychology, 25(4), 531–546. [Google Scholar] [CrossRef]
Lewandowsky, S., Ecker, U. K. H., & Cook, J. (2017). Beyond misinformation: Understanding and coping with the “post-truth” era. Journal of Applied Research in Memory and Cognition, 6(4), 353–369. [Google Scholar] [CrossRef]
Lovari, A. (2020). Spreading (dis)trust: COVID-19 misinformation and government intervention in Italy. Media and Communication, 8(2), 458–461. [Google Scholar] [CrossRef]
Makhortykh, M., Sydorova, M., Baghumyan, A., Vziatysheva, V., & Kuznetsova, E. (2024, August 26). Stochastic lies: How LLM-powered chatbots deal with Russian disinformation about the war in Ukraine. Harvard Kennedy School Misinformation Review. [Google Scholar] [CrossRef]
Marsden, C., & Meyer, T. (2019). Regulating disinformation with artificial intelligence. European Parliamentary Research Service, European Science-Media Hub. Available online: https://www.europarl.europa.eu/RegData/etudes/STUD/2019/624279/EPRS_STU(2019)624279_EN.pdf (accessed on 1 June 2025).
Martinez, R. (2019). Artificial intelligence: Distinguishing between types & definitions. Nevada Law Journal, 19(3), 9. [Google Scholar]
Nakov, P., Corney, D., Hasanain, M., Alam, F., Elsayed, T., Barrón-Cedeño, A., Papotti, P., Shaar, S., & Martino, G. D. S. (2021). Automated fact-checking for assisting human fact-checkers. arXiv, arXiv:2103.07769. [Google Scholar] [CrossRef]
Patoucha, E. (2022). Detection of fake news in Greek text for the case of COVID-19 using the Python programming language [Master’s thesis, National Technical University of Athens]. [Google Scholar] [CrossRef]
Pechlivanidou, K. (2023). The contribution of social media to the spread of fake news and the EU’s response [Master’s thesis, University of Western Macedonia]. Available online: https://dspace.uowm.gr/xmlui/handle/123456789/3385 (accessed on 1 June 2025).
Pepp, J., Michaelson, E., & Sterken, R. K. (2019). What’s new about fake news. Journal of Ethics and Social Philosophy, 16, 67. [Google Scholar] [CrossRef]
Rodríguez-Ferrándiz, R. (2023). An overview of the fake news phenomenon. Media and Communication, 11(2), 15–29. [Google Scholar] [CrossRef]
Rodríguez-Pérez, C., & Canel, M. J. (2023). Exploring European citizens’ resilience to misinformation: Media legitimacy and media trust as predictive variables. Media and Communication, 11(2), 30–41. [Google Scholar] [CrossRef]
Roose, K. (2023, March 28). How does ChatGPT really work. New York Times. [Google Scholar]
Roozenbeek, J., Schneider, C. R., Dryhurst, S., Kerr, J., Freeman, A. L. J., Recchia, G., van der Bles, A. M., & van der Linden, S. (2020). Susceptibility to misinformation about COVID-19 around the world. Royal Society Open Science, 7(10), 201199. [Google Scholar] [CrossRef] [PubMed]
Roumeliotis, K. I., & Tselikas, N. D. (2023). ChatGPT and open-AI models: A preliminary review. Future Internet, 15(6), 192. [Google Scholar] [CrossRef]
Samoili, S., Cobo, M. L., Gómez, E., De Prato, G., Martínez-Plumed, F., & Delipetrev, B. (2020). AI Watch. Defining Artificial Intelligence. Towards an operational definition and taxonomy of artificial intelligence. Publications Office of the European Union. [Google Scholar] [CrossRef]
Setiawan, R., Ponnam, V. S., Sengan, S., Anam, M., Subbiah, J., Phasinam, K., Manikandan, V., & Ponnusamy, S. (2022). Certain investigation of fake news detection from facebook and twitter using artificial intelligence approach. Wireless Personal Communications, 127, 1737–1762. [Google Scholar] [CrossRef]
Shrivastava, S., Singh, R., Jain, C., & Kaushal, S. (2022). Fake news detection using machine learning algorithms. In A. K. Somani, A. Mundra, R. Doss, & S. Bhattacharya (Eds.), Smart systems: Innovations in computing (pp. 297–308). Springer. [Google Scholar] [CrossRef]
Sidorenko Bautista, P., Alonso-López, N., & Giacomelli, F. (2021). Fact-checking in TikTok. Communication and narrative forms to combat misinformation. Magazine Latin of Communication Social, (79), 87–113. [Google Scholar] [CrossRef]
Sierra, A., Novoa Jaso, M. F., & Serrano-Puche, J. (2024). Disinformation and media trust in the south of Europe: A moderated mediation model. index. Comunicación, 14(2), 109–135. [Google Scholar] [CrossRef]
Similarweb. (2025). Top websites ranking in Greece. Available online: https://www.similarweb.com/top-websites/greece/ (accessed on 1 June 2025).
Sukhodolov, A. P., & Bychkova, A. M. (2017). Fake news as a modern media phenomenon: Definition, types, role of fake news and ways of counteracting it. Boпpocытeopииипpaктикижypнaлиcтики, 6(2), 143–169. [Google Scholar] [CrossRef]
Tandoc, E. C., Lim, Z. W., & Ling, R. (2018). Defining “fake news”: A typology of scholarly definitions. Digital Journalism, 6(2), 137–153. [Google Scholar] [CrossRef]
Tasnim, S., Hossain, M. M., & Mazumder, H. (2020). Impact of rumors and misinformation on COVID-19 in social media. Journal of Preventive Medicine and Public Health, 53(3), 171–174. [Google Scholar] [CrossRef] [PubMed]
Veglis, A., & Maniou, T. A. (2019). Chatbots on the rise: A new narrative in journalism. Studies in Media and Communication, 7(1), 1–6. [Google Scholar] [CrossRef]
Wangsa, K., Karim, S., Gide, E., & Elkhodr, M. (2024). A systematic review and comprehensive analysis of pioneering AI chatbot models from education to healthcare: ChatGPT, Bard, Llama, Ernie and Grok. Future Internet, 16(7), 219. [Google Scholar] [CrossRef]
Wardle, C. (2017). Understanding information disorder. First Draft. Available online: https://firstdraftnews.org/long-form-article/understanding-information-disorder/ (accessed on 20 May 2025).
Wu, L., Morstatter, F., Carley, K. M., & Liu, H. (2019). Misinformation in social media: Definition, manipulation, and detection. ACM SIGKDD Explorations Newsletter, 21(2), 80–90. [Google Scholar] [CrossRef]
Yu, X., Wang, Y., Chen, Y., Tao, Z., Xi, D., Song, S., Niu, S., & Li, Z. (2024). Fake artificial intelligence generated contents (FAIGC): A survey of theories, detection methods, and opportunities. arXiv, arXiv:2405.00711. [Google Scholar] [CrossRef]
Zannettou, S., Sirivianos, M., Blackburn, J., & Kourtellis, N. (2019). The web of false information. Journal of Data and Information Quality, 11(3), 1–37. [Google Scholar] [CrossRef]
Zhang, X., & Ghorbani, A. A. (2020). An overview of online fake news: Characterization, detection, and discussion. Information Processing & Management, 57(2), 102025. [Google Scholar] [CrossRef]

Scheme 2. Summary of the methodology.

Figure 1. Performance of chatbots in this research. Source: Authors’ own work.

Figure 2. Performance of reference research chatbots. Source: Caramancion (2023).

Table 1. Total debunked cases.

Total	930
Ellinika Hoaxes	533
AFP Greece	397

Source: Authors’ own work.

Table 2. Non-true story case categorization.

	N	%
Total	930	100%
Misinformation	243	26.1%
Dangerology	4	0.4%
Fake news	117	12.6%
Conspiracy theory	24	2.6%
Pseudoscience	12	1.3%
Scams	4	0.4%
Mixture of facts and falsifications	24	2.6%
Thematic content missing	69	7.4%
Modified video	2	0.2%
Modified image	31	3.3%
Created with artificial intelligence	81	8.7%
False sayings and quotes	1	0.1%
Satire	6	0.6%
False	215	23.1%
Incomplete frame	48	5.2%
Misleading	49	5.3%

Source: Authors’ own work based on Ellinika Hoaxes and AFP Greece non-true story classification.

Table 3. Categorization of incidents from portals, social media, newspapers, and tv/radio.

Source	n	%
Portal/blog	448	48.2%
Newspaper	19	2.0%
Tv/radio	15	1.6%
Social media	389	41.8%
No recorded data	59	6.3%

Source: Authors’ own work based on Lamprou et al. (2021).

Table 4. Website ranking.

Website Traffic Rank (Similarweb)	n	%
Top 50 websites	53	11
Outside top 50	429	89
Total	482	100

Source: Authors’ own work.

Table 5. Totals of chatbots’ answers.

Chatbot	Correct	Incorrect
ChatGPT3.5	677	239
Gemini	587	329
Greek Fact-check Bot	710	206

Source: Authors’ own work.

Table 6. Chatbot performance bycategory.

Category	ChatGPT (v3.5)	Gemini	Greek Fact-Check Bot
Misinformation	86.8	77.4	81.5
Dangerology	75.0	50.0	100.0
Fake news	71.8	86.3	80.3
Conspiracy theory	66.7	83.3	75.0
Pseudoscience	75.0	83.3	91.7
Scams	50.0	25.0	75.0
Mixture of facts and falsifications	62.5	75.0	83.3
Thematic content missing	78.3	60.9	73.9
Modified video	50.0	50.0	100.0
Modified image	77.4	77.4	96.8
Created with artificial intelligence	20.9	6.2	66.7
False sayings and quotes	0.0	0.0	100.0
Satire	33.3	16.7	50.0
False	74.0	63.2	79.6
Incomplete frame	85.4	58.3	45.8
Misleading	65.7	34.3	82.9

Source: Authors’ own work based on Ellinika Hoaxes and AFP Greece non-true story classification.

Table 7. Chatbot performance on visibility category.

Source Visibility	ChatGPT (v3.5)	Gemini	Greek Fact-Check Bot
High-Visibility Sources (Similarweb Top 50)	75.5	66.0	79.2
Lower-Visibility Sources (Outside Top 50)	73.6	63.9	77.3

Source: Authors’ own work.

Table 8. Chatbot performance by source category.

Source	ChatGPT (v3.5)	Gemini	Greek Fact-Check Bot
Portal/blog	73.8	63.5	78.1
Newspaper	72.4	61.9	76.6
Tv/radio	84.2	76.3	89.5
Social media	86.7	80.0	93.3

Source: Authors’ own work based on Lamprou et al. (2021).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lamprou, E.; Marmouta, A. AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape. Journal. Media 2026, 7, 66. https://doi.org/10.3390/journalmedia7010066

AMA Style

Lamprou E, Marmouta A. AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape. Journalism and Media. 2026; 7(1):66. https://doi.org/10.3390/journalmedia7010066

Chicago/Turabian Style

Lamprou, Evangelos, and Aikaterini Marmouta. 2026. "AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape" Journalism and Media 7, no. 1: 66. https://doi.org/10.3390/journalmedia7010066

APA Style

Lamprou, E., & Marmouta, A. (2026). AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape. Journalism and Media, 7(1), 66. https://doi.org/10.3390/journalmedia7010066

Article Menu

AI Chatbot Showdown in News Fact Checking: Exploring Automated Verification in the Greek Media Landscape

Abstract

1. Introduction

2. Literature Review

2.1. Defining Misinformation and Disinformation

2.2. Subtypes of Fake News

2.3. Tackling Fake News

2.4. Artificial Intelligence and Generative AI

2.5. AI as a Tool for Detecting Fake News and Other Chatbots

2.6. AI Methods for Detecting Fake News

2.7. Prior Studies

3. Methodology and Research Questions

3.1. Present Study: Research Design and Research Questions

3.2. Data and Period

3.3. Systems Evaluated

3.4. Development and Functional Architecture of the Customized Fake-News Detection Chatbot

3.4.1. Conceptual Foundations and Instruction Framework

3.4.2. Behavioral Calibration and Analytical Output

3.4.3. Interaction Logic and Verification Workflow

3.4.4. Adaptation and Iterative Refinement

3.5. Research Procedure

4. Results and Analysis

Chatbot Performance Evaluation Results

5. Discussion

6. Conclusions

7. Limitations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI