1. Introduction
Modern journalism faces a variety of challenges, one of which is the wide dissemination of information. Rapid news circulation now comes primarily from the internet, and society receives vast amounts of content from multiple media sources. Moreover, individuals themselves can record and share information that may not have been verified. As a result, the journalist’s role in ensuring the accuracy of information published to the public has become increasingly important. Journalists are often referred to as “gatekeepers,” as their primary responsibility is to properly inform society (
Lamprou et al., 2021). However, information from the internet frequently contains unverified elements, often referred to as “fake news.” The term itself is not new, according to the Collins Dictionary; it was declared the Word of the Year in 2017 (
Anderau, 2021).
Journalism has traditionally been understood as a professional practice balancing between the roles of gatekeeping and advocacy (
Janowitz, 1975). Within this professional framework, verification and accountability constitute core normative principles that differentiate journalism from other forms of public communication. The emergence of fake news and disinformation challenges these foundational norms by introducing systematically misleading content into the information environment. Scholarly efforts to define fake news emphasize its conceptual ambiguity and its overlap with related phenomena such as misinformation and disinformation (
Tandoc et al., 2018). At the institutional level, international organizations and journalism educators increasingly frame disinformation as a structural threat to democratic communication and media trust (
Ireton & Posetti, 2018). The phenomenon has also been examined within broader post-truth dynamics, in which factual accuracy competes with emotional, ideological, and identity-driven narratives. From this perspective, misinformation is not treated merely as isolated false claims, but as part of a wider epistemic environment shaped by political polarization, declining trust in institutions, fragmented media systems, and the emergence of alternative belief frameworks that challenge conventional standards of evidence (
Rodríguez-Ferrándiz, 2023;
Lewandowsky et al., 2017). Within such contexts, the impact of misinformation extends beyond individual misperceptions and contributes to broader transformations in public knowledge, trust, and democratic communication. Situating the present study within this theoretical lineage allows the role of AI-based verification tools to be examined not only as technical solutions, but as interventions within a long-standing journalistic struggle over truth and credibility.
Artificial intelligence (AI) also plays a significant role in people’s daily lives. Although a relatively recent innovation, AI already has applications across many domains, one of which is the detection of fake news. The importance of examining this phenomenon lies in the rapid development of AI and the widespread dissemination of false information. For this reason, it is essential to assess the effectiveness of AI chatbots in addressing an issue that directly affects the safeguarding of the public sphere.
AI chatbots may also contribute to reducing the spread of misinformation. Artificial intelligence has been increasingly applied to the detection and mitigation of misinformation across multiple communication environments. Early studies demonstrated the effectiveness of machine learning techniques in identifying misleading textual content at scale (
Shrivastava et al., 2022), while more recent research has examined AI systems capable of detecting multimodal, visual, and AI-generated forms of disinformation (
Lee & Shin, 2022). In parallel, applied research has explored the integration of AI-powered fact-checking tools within journalistic and platform-based workflows, highlighting both their potential and their limitations (
Cantón-Correa et al., 2025). These developments underline the importance of empirically assessing how contemporary AI systems perform under real journalistic verification conditions.
The purpose of this research is to investigate the ability of AI chatbots to detect fake news. Rather than claiming absolute performance superiority of newer AI models, this study adopts a replication-and-diagnostic perspective, examining how and where AI chatbots succeed or fail in journalistic verification tasks. The study contributes to the growing literature on automated fact-checking in five ways. First, it provides the first large-scale empirical evaluation of AI chatbot-based verification within the Greek media ecosystem, extending prior work that has focused primarily on English-language contexts. Second, it offers a systematic temporal replication of previous chatbot evaluations (
Caramancion, 2023) using a comparable methodological framework and a newer dataset (2025), enabling longitudinal assessment of progress in LLM-based verification. Third, the study empirically compares general-purpose AI chatbots with a task-specific customized verification system, demonstrating how structured prompting, tool integration, and alignment with professional fact-checking databases influence detection performance. Fourth, beyond overall accuracy, the analysis examines performance variation across misinformation categories and source types, revealing persistent weaknesses that remain obscured in aggregate metrics. Finally, the findings provide empirical support for hybrid fact-checking models, showing that AI systems are most effective when embedded within human-centered verification workflows rather than deployed as autonomous arbiters of truth. Accordingly, the study pursues three interrelated objectives: to evaluate the performance of contemporary AI chatbots in detecting professionally debunked non-true stories in the Greek media environment, to compare general-purpose systems with a customized fact-checking-oriented chatbot, and to examine how content and source characteristics shape automated verification outcomes.
4. Results and Analysis
The total number of debunked incidents is 930, of which 533 derive from Ellinika Hoaxes and 397 from AFP Greece, respectively (
Table 1).
The dataset comprised a total of 930 claims, which were classified into distinct misinformation-related categories based on the verdicts provided by professional fact-checking organizations.
Table 2 presents the distribution of claims across categories, including absolute frequencies and relative percentages.
The largest category was misinformation (n = 243; 26.1%), followed by false claims (n = 215; 23.1%) and fake news (n = 117; 12.6%). A substantial proportion of cases involved content created with artificial intelligence (n = 81; 8.7%) and instances where thematic content was missing (n = 69; 7.4%).
Additional categories included misleading content n = 49; 5.3%), incomplete framing (n = 48; 5.2%) and modified images (n = 31; 3.3%). Less frequent classifications consisted of conspiracy theories (n = 24; 2.6%), mixtures of factual and false information (n = 24; 2.6%) and pseudoscience (n = 12; 1.3%).
Rare categories included satire (n = 6; 0.6%), scams (n = 4; 0.4%), dangerology (n = 4; 0.4%), modified videos (n = 2; 0.2%) and false sayings or quotations (n = 1; 0.1%).
Furthermore, the incidents’ sources were categorized into portal/blog, newspaper, social media, and tv/radio according to the methodology of
Lamprou et al. (
2021).
Table 3 presents the distribution of misinformation incidents by source type. Nearly half of the cases originated from portals and blogs (
n = 448; 48.2%), followed closely by social media platforms (
n = 389; 41.8%). Traditional media sources accounted for a considerably smaller share of incidents, such as with newspapers (
n = 19; 2.0%) and television and radio (
n = 15; 1.6%). In 6.3% of cases (
n = 59), information regarding the original source was unavailable.
As displayed in
Table 4, the websites were ranked using the Similarweb Top 50 traffic scale, which provides an estimate of overall website traffic based on general web rankings rather than rankings limited to entertainment or informational content. This metric was used to assess the relative visibility of websites associated with mis/disinformation incidents. Based on this ranking, 53 incidents were linked to websites classified as high-traffic sources, defined as those appearing within the Similarweb Top 50 (
Similarweb, 2025). These incidents represent 11% of the total number of website-based cases analyzed (
N = 482).
The final stage of the study examined the ability of chatbots to correctly assess the validity of news-related claims. The systems evaluated were ChatGPT (version 3.5), Gemini, and the Greek Fact-check Bot. For each system, responses were coded as either correct or incorrect based on their alignment with the verdicts of professional fact-checking organizations.
Table 5 presents the number of correct and incorrect responses produced by each chatbot across the full set of evaluated claims. Overall, the Greek Fact-check Bot produced the highest number of correct assessments, followed by ChatGPT (3.5), while Gemini exhibited the lowest accuracy among the three systems. These findings highlight meaningful differences in chatbot performance when applied to journalistic fact-checking tasks.
To examine whether differences in accuracy between the three chatbot systems were statistically significant, Cochran’s Q test was applied to the paired binary outcomes. The analysis revealed statistically significant differences in accuracy across the three systems (p < 0.001). Following this result, posthoc pairwise comparisons were conducted using McNemar’s test with Holm correction to control for multiple comparisons. These analyses confirmed that the Greek Fact-check Bot performed significantly better than both Gemini and ChatGPT, while ChatGPT also significantly outperformed Gemini. This analytical sequence provides statistical support for the performance differences reported in the descriptive results.
Chatbot Performance Evaluation Results
Beyond overall accuracy, the analysis focuses on identifying systematic variation in chatbot performance across content categories and source-related characteristics, revealing structural strengths and weaknesses that are not captured by aggregate performance metrics alone. The evaluation of chatbot performance was conducted on a dataset of 916 debunked news claims for which complete responses were available from all three systems. Overall accuracy results indicate that the Greek Fact-check Bot achieved the highest performance, correctly classifying 77.5% of the evaluated claims. ChatGPT (v3.5) followed with an accuracy of 73.9%, while Gemini demonstrated the lowest overall accuracy at 64.1%. These results confirm that all examined chatbots were able to detect non-true stories to a considerable extent, although notable differences in performance were observed across systems.
As displayed in
Table 6, when performance was examined across different misinformation categories, substantial variation emerged. In categories such as misinformation, fake news and false claims, all three chatbots achieved moderate to high accuracy levels, with the Greek Fact-check Bot consistently ranking among the highest-performing systems. In the category of misleading content, the Greek Fact-check Bot demonstrated notably higher accuracy compared to ChatGPT and Gemini. Detection accuracy for incomplete framing was relatively high for ChatGPT, moderate for Gemini, and lower for the Greek Fact-check Bot.
Across all systems, the lowest accuracy rates were recorded in the detection of AI-generated content. ChatGPT and Gemini showed particularly limited success in this category, whereas the Greek Fact-check Bot achieved substantially higher accuracy, though still below perfect classification. In categories involving manipulated visual material, such as modified images and modified videos, the Greek Fact-check Bot achieved the highest accuracy, including perfect classification in modified video cases.
Performance in categories with smaller numbers of incidents, such as satire, scams, fear-based misinformation, and false quotes, varied considerably. In these categories, the Greek Fact-check Bot generally achieved higher accuracy than the general-purpose chatbots, while ChatGPT and Gemini displayed lower and more inconsistent results. Despite category-level variability, the relative performance ranking of the three systems remained largely consistent across categories, with the Greek Fact-check Bot outperforming ChatGPT, and ChatGPT outperforming Gemini in most cases.
Finally, as displayed in
Table 7, analysis of cases originating from high-traffic websites, as identified through the Similarweb Top 50 ranking, showed that chatbot performance patterns remained comparable to those observed in the overall dataset. The Greek Fact-check Bot again demonstrated the highest accuracy, followed by ChatGPT and Gemini. These findings indicate that misinformation detection challenges persist regardless of the visibility or popularity of the source.
Table 8 presents the classification performance of the three chatbot systems across different source categories, including portals/blogs, social media, newspapers, and television or radio. Across all source types, the Greek Fact-check Bot achieved the highest accuracy, followed by ChatGPT (v3.5), while Gemini consistently demonstrated lower performance. This ranking was observed for both digital-native sources and traditional media outlets.
Accuracy was highest for content originating from social media and broadcast media, while lower performance was observed for portals/blogs and newspapers. These results indicate that chatbot effectiveness varies by source category and suggest that the media origin of content is associated with differences in automated detection performance.
5. Discussion
Although improvements in AI chatbot performance over time are expected given the rapid development of large language models, the purpose of the present study is not to demonstrate progress in isolation, but to examine how such progress manifests across different verification contexts. Rather than treating increased accuracy as a primary contribution, the study uses performance differences as an analytical lens to identify structural strengths and limitations of chatbot-based verification systems in real-world journalistic environments. The findings show that performance gains are uneven and strongly dependent on content characteristics, source type, and system design. While accuracy improves in well-structured, text-based claims, persistent weaknesses remain in categories involving AI-generated content, manipulated visuals, and context-dependent misinformation. For example, in several cases involving AI-generated or visually manipulated content, chatbots either failed to recognize synthetic elements or relied primarily on surface-level textual cues without detecting underlying distortions. One case involved a viral image of a burned Oscar statuette circulated in connection with the California wildfires, which was later verified as an AI-generated image despite being presented as authentic visual evidence (
Ellinika Hoaxes, 2025b). In another instance, an image showing a protester dressed as Pikachu during demonstrations in Turkey was also found to be AI-generated, although it had been widely shared as genuine footage (
Ellinika Hoaxes, 2025c). Such cases illustrate persistent difficulties in handling multimodal, context-dependent, or technically complex forms of misinformation, which require deeper verification routines beyond surface-level textual analysis.
This pattern indicates that technological progress does not uniformly translate into verification reliability and highlights the importance of evaluating AI systems beyond aggregate performance metrics. By demonstrating where improvements occur and where limitations persist, the study contributes explanatory insight rather than incremental benchmarking alone. In this sense, the expected nature of overall improvement strengthens, rather than weakens, the contribution of the study, as it allows for systematic analysis of the conditions under which AI-based verification succeeds or fails.
The findings related to RQ1 demonstrate that AI chatbot systems differ meaningfully in their ability to detect non-true stories when evaluated against professionally debunked claims under identical experimental conditions. All three examined systems—ChatGPT (v3.5), Gemini, and the Greek Fact-check Bot—correctly classified a substantial proportion of professionally debunked claims, confirming that contemporary AI systems have reached a level of maturity that allows them to meaningfully support journalistic verification processes. At the same time, the presence of consistent misclassifications across all systems indicates that chatbots are not yet capable of fully autonomous verification and require human oversight.
As depicted in
Figure 1, clear differences were observed between the examined systems, with the Greek Fact-check Bot achieving the highest accuracy, followed by ChatGPT (v3.5), while Gemini exhibited the lowest performance. These differences indicate that task-specific customization and workflowdesign play a decisive role in improving verification outcomes. The superior performance of the Greek Fact-check Bot should not be interpreted as a claim of practical superiority based on marginal accuracy gains, but rather as evidence that structured prompting, system integration, and alignment with professional fact-checking databases can produce measurable improvements even when the underlying language model is comparable. This finding highlights the importance of system design choices in hybrid fact-checking environments rather than model-level optimization alone.
With respect to RQ2, the analysis revealed that chatbot performance varies substantially across different categories of non-true stories. ChatGPT showed stronger performance in narrative-based and contextual categories, such as misinformation and incomplete framing, but struggled with technically complex categories, particularly AI-generated content. Gemini demonstrated relatively higher accuracy in narrowly defined categories, including conspiracy theories and pseudoscience, while underperforming in several core journalistic categories, such as misleading content, satire, and scams. Across all systems, categories with limited numbers of cases should be interpreted cautiously; nevertheless, the findings clearly indicate that the nature of the misinformation strongly influences chatbot effectiveness. These patterns are consistent with prior research showing that automated fact-checking performance is highly contingent on content type, contextual complexity, and system design (
Nakov et al., 2021;
Makhortykh et al., 2024), reinforcing the need for diagnostic evaluation approaches alongside aggregate accuracy metrics.
Regarding RQ3, comparison with previous empirical research, particularly
Caramancion (
2023), as presented in
Figure 2, indicates a general improvement in chatbot performance over time. Accuracy levels observed in the present study are higher than those reported in earlier evaluations, suggesting ongoing technological progress in AI-based fact-checking. At the same time, the comparison confirms that even improved systems continue to exhibit systematic weaknesses, reinforcing the need for cautious and critical deployment in journalistic contexts. Cross-study comparisons should be interpreted cautiously, as differences in language, dataset composition, temporal context, and prompt design can substantially influence chatbot performance independently of underlying model architecture. Accordingly, the higher accuracy observed for ChatGPT (v3.5) in the present study should not be interpreted as evidence of superiority over earlier GPT-4 evaluations, but rather as a context-specific outcome of methodological and data-related factors.
The findings related to RQ4 indicate that chatbot detection performance is shaped more strongly by media source category than by source visibility. Non-true stories originating from high-visibility sources are detected with comparable effectiveness to those from less prominent outlets, suggesting that automated verification systems do not inherently privilege content based on its public exposure. In contrast, higher detection accuracy is observed for content originating from traditional media sources, such as newspapers and broadcast outlets, compared to content disseminated through portals/blogs and social media platforms. This pattern likely reflects differences in content structure, linguistic formalization, and contextual framing across media environments, which may facilitate or hinder automated assessment. Overall, these findings underscore the importance of incorporating source characteristics into the design and evaluation of AI-assisted fact-checking systems.
These findings suggest that, despite measurable improvements in automated detection, AI chatbots remain limited in their ability to replicate human capacities such as contextual judgment, ethical reasoning, and critical interpretation. These limitations should be understood as interpretative implications of the observed error patterns rather than as directly measured deficits. Consequently, the results reinforce the importance of hybrid fact-checking models in which AI systems operate as assistive tools within human-centered verification workflows. Frameworks such as Veri|Fusion (
Lamprou & Antonopoulos, 2023) exemplify this approach by integrating automated detection, crowdsourced input, and professional human oversight, with humans retaining the final decision-making role.
6. Conclusions
Across all evaluated dimensions, AI chatbots exhibited meaningful levels of accuracy, indicating that contemporary large language models are capable of supporting journalistic verification tasks. Compared to earlier empirical studies, such as
Caramancion (
2023), the performance of ChatGPT and Gemini suggests incremental progress in automated detection capabilities. The enhanced performance of the customized Greek Fact-check Bot further highlights the benefits of task-specific design, structured prompting and alignment with professional fact-checking databases. These findings suggest that specialization and contextual adaptation significantly enhance the effectiveness of AI-based verification tools.
The findings further show that while source visibility does not substantially affect detection accuracy, the category of the media source plays a more significant role, with lower performance observed for content disseminated through portals, blogs, and social media compared to traditional media formats. Overall, these limitations highlight that AI chatbots lack essential journalistic competencies, such as editorial judgment, ethical reasoning, and contextual interpretation. Consequently, AI-based systems should be understood as supportive tools within human-centered verification processes rather than as autonomous substitutes for professional fact-checking. These findings underscore the importance of adopting hybrid fact-checking models that combine automated systems with human expertise. Rather than replacing journalists or professional fact-checkers, AI chatbots should be positioned as assistive technologies that enhance efficiency, scalability and preliminary filtering. In such models, AI systems can support the identification of potentially misleading content, facilitate evidence retrieval and assist in categorizing claims, while humans retain the primary and final authority over verification decisions.
From a broader perspective, the findings carry important implications for journalism and media organizations. As misinformation continues to evolve, particularly through the use of generative AI, newsrooms and fact-checking organizations must invest not only in advanced technological tools but also in institutional frameworks that safeguard editorial responsibility. Hybrid systems offer a pragmatic pathway forward, enabling media professionals to leverage AI innovations while preserving the normative role of journalism in protecting the public sphere.
In conclusion, while AI chatbots have made, as expected, notable progress in detecting non-true stories, they cannot yet function as independent fact-checkers. The future of effective verification lies in human-centered, hybrid models, where artificial intelligence supports, but does not supplant, the critical role of human judgment. Furthermore, researchers insist that there are legal reasons why humans need to be kept in the loop for content moderation. According to a significant study funded by the European Science-Media Hub, limiting the automated execution of decisions on AI-discovered problems is essential in ensuring human agency and natural justice: the right to appeal. That does not prevent the suspension of bot accounts at scale, but ensures the correct auditing of the system processes deployed (
Marsden & Meyer, 2019;
Kertysova, 2018). Such an approach ensures both technological efficiency and democratic accountability in the ongoing effort to combat non-true stories.
Overall, the findings indicate that the effectiveness of AI-assisted verification depends less on raw model capability and more on system design, workflow integration, and contextual adaptation. The comparative results suggest that task-specific systems aligned with professional fact-checking practices can outperform general-purpose conversational models in real-world verification scenarios. These insights support the development of hybrid verification environments in which automated tools function as support systems within human-centered editorial processes, rather than as autonomous substitutes for professional fact-checking.