Cross-Domain Fake News Detection Using a Prompt-Based Approach

: The proliferation of fake news poses a significant challenge in today’s information landscape, spanning diverse domains and topics and undermining traditional detection methods confined to specific domains. In response, there is a growing interest in strategies for detecting cross-domain misinformation. However, traditional machine learning (ML) approaches often struggle with the nuanced contextual understanding required for accurate news classification. To address these challenges, we propose a novel contextualized cross-domain prompt-based zero-shot approach utilizing a pre-trained Generative Pre-trained Transformer (GPT) model for fake news detection (FND). In contrast to conventional fine-tuning methods reliant on extensive labeled datasets, our approach places particular emphasis on refining prompt integration and classification logic within the model’s framework. This refinement enhances the model’s ability to accurately classify fake news across diverse domains. Additionally, the adaptability of our approach allows for customization across diverse tasks by modifying prompt placeholders. Our research significantly advances zero-shot learning by demonstrating the efficacy of prompt-based methodologies in text classification, particularly in scenarios with limited training data. Through extensive experimentation, we illustrate that our method effectively captures domain-specific features and generalizes well to other domains, surpassing existing models in terms of performance. These findings contribute significantly to the ongoing efforts to combat fake news dissemination, particularly in environments with severely limited training data, such as online platforms.


Introduction
In recent years, the surge of fake news on social media platforms has become a pressing concern.Given the vast volume of online text, manual filtering has become impractical, highlighting the necessity for automated FND methods.This kind of automated detection has received significant attention from researchers.However, distinguishing false information poses challenges, mainly due to the limitations of current techniques, which are often less effective for broader topics like politics, entertainment, and others.Fact-checking websites like PolitiFact (http://www.politifact.com/,accessed on 1 January 2024) have undoubtedly played a role in combating fake news; however, these platforms typically concentrate on specific domains, like politics, and rely on human specialists to verify news content.As a result, generalizing their methods to other domains becomes difficult [1].Additionally, despite efforts towards automatic detection, obstacles persist, particularly in scenarios with minimal or no available text, where the need for extensive text hinders precise predictions.
The emergence of pre-trained language models (PLMs) has revolutionized Natural Language Processing (NLP) by achieving state-of-the-art (SOTA) performance in various tasks.This success is attributed to their ability to capture contextualized information and encode rich knowledge about text through fine-tuning approaches.However, a major challenge hindering the full potential of PLMs is their reliance on vast amounts of labeled training data.The scarcity of annotated data presents a significant bottleneck in the NLP community, necessitating innovative approaches to leverage the power of PLMs even with limited training data.Recent advancements in few-shot learning techniques offer promising solutions to this challenge.Models such as GPT-3 [2] have demonstrated remarkable performance on few-shot tasks, highlighting the potential for effective application with limited data.Nevertheless, these models often come at the cost of high computational requirements due to their massive parameter size (e.g., GPT-3 with its 175B parameters), making them less accessible for widespread deployment.Inspired by the remarkable ability of PLMs to capture deep contextual information, researchers are actively developing complementary strategies to enhance their capabilities.These strategies aim to reduce the complexity of PLMs while simultaneously augmenting their ability to understand context.This combined approach empowers PLMs to achieve outstanding results even in scenarios with limited or no labeled data, thus expanding their applicability across diverse NLP tasks.For instance, recent advancements in NLP have illuminated the transformative potential of prompt-based learning strategies in harnessing the capabilities of PLMs across diverse applications.Prompts are usually used as prefixes that serve as additional information or context that guides the model's generation process.Extensive research, as demonstrated in [2][3][4][5][6][7], has underscored the profound impact of prompts on enhancing PLM performance.The authors of [2] demonstrated exceptional performance in few-shot learning scenarios, relying solely on a natural language prompt and a limited number of task demonstrations as the input context.Gu et al. [3] pioneered the Pre-Trained Prompt Tuning (PPT) framework, leveraging prompts to optimize PLMs for downstream tasks.Similarly, Loem et al. [4] showcased the efficacy of prompt-based methods in leveraging GPT-3 for Grammatical Error Correction (GEC) tasks.Further advancements by Hu et al. [5] in clinical named entity recognition (NER) tasks using GPT-3.5 and GPT-4 underscore the versatility of task-specific prompts in augmenting PLMs' capabilities.Additionally, Lester et al. [6] highlighted the efficacy of prompt tuning in enhancing the performance of frozen language models across specific downstream tasks.In [7], the authors introduced Pattern-Exploiting Training (PET), a novel semi-supervised training approach designed to enhance the performance of language models in low-resource settings.
In the context of text classification, a crucial technique known as prompt engineering is commonly employed.This approach involves strategically including additional information (i.e., a prompt), denoted as P, before the actual input text X.By conditioning a model on this combined input [P; X], the aim is to maximize the likelihood of the model predicting the correct class label Y.This is achieved while keeping the model's internal parameters θ unchanged.In essence, prompt-based approaches reinterpret the task as a blank-filling problem, where a token (representing the label) is predicted using a verbalizer applied to a given prompt with a task-specific template.Inspired by the study findings presented in [2], we present a prompt-based GPT2 model for FND.Our approach involves designing a template that focuses on incorporating contextual information from input sentences to guide the fine-tuning process of the language model.This template, instantiated as a prompt, serves as a cue for the model to discern the credibility of the input text and predict the binary classification probability of whether the text corresponds to the label of interest (e.g., "fake" or "real").For instance, with a given sentence x, e.g., "This news article discusses the latest technology trends", we construct a prompt template by appending a blank token to the sentence, as follows: x + prompt .Subsequently, this template is passed through our model, and the generated words in the blank token are mapped into categories via a process we refer to as verbalization.In this work, we demonstrate that guiding PLMs to identify fake news in a zero-shot capacity (i.e., based on zero-shot transfer or zero-shot learning, which usually refers to the task of training a model on one set of data and then evaluating it on a different set of data) can be facilitated by harnessing the knowledge acquired during pre-training.This approach diminishes the necessity for gathering data specifically for task-oriented supervision, thereby alleviating the challenges posed by data scarcity in certain scenarios.We conducted various experiments on a publicly available dataset that encompasses different domains, and the results show that our method can achieve better or on-par performance than the current SOTA methods.In addition, we evaluated our model's performance in zero-shot settings and found that it greatly outperformed SOTA baselines.Our framework can detect, with promising accuracy, fake news across different domains without any prior knowledge or training on fake content derived from certain domains.The task we employed to test our method can be seen as a zero-shot cross-domain FND task.

Contributions of This Work
By transferring knowledge from one domain to another, transfer learning can shorten training times, improve the generalization of deep learning (DL) and ML models, and address issues related to a lack of data [8].Our aim is to investigate the effectiveness of contextualized PLMs such as GPT2 with prompt-based learning for FND.Specifically, we seek to address the following main research question: Can we train a model in one domain and transfer its knowledge to another domain for detecting fake news?In our effort to achieve this, we offer the following contributions:

•
In contrast to the straightforward prompt-based learning approach, we propose a GPT-based model with contextualized prompt-based learning augmented by auxiliary information, such as domain information, for FND in English.The results of our experiments demonstrate that this model, when trained on a multi-domain dataset, outperforms a model without auxiliary information.Furthermore, we establish benchmark results for English-language FND using the TALLIP dataset.

•
We conducted ablation studies on various prompt design choices to assess their impact on performance.

•
We conducted experiments to compare the effectiveness of a contextualized prompttuning approach against a context-independent model in cross-domain FND.

•
We found that the context-aware GPT model is effective in zero-shot cross-domain settings, significantly outperforming existing strong baselines.Our experimental evaluation indicates that feature representations are highly transferable across different domains.

Related Work
Large language models (LLMs) have been observed to effectively grasp a vast amount of implicit knowledge [9,10], and prompting these models holds significant promise for tackling a diverse range of NLP tasks [11].PLMs utilize two primary training methodologies: pre-training, where models are trained from the ground up using extensive datasets, and fine-tuning, where these models are adapted to specific downstream tasks without requiring complete re-training.In supervised learning scenarios, a classification layer is typically added atop these models to facilitate predictions.However, with the advent of models boasting millions or even billions of parameters, task-specific fine-tuning has presented significant challenges.
According to [12], the transformative capabilities of LLMs have led to the emergence of a novel paradigm known as "pre-training then prompting", which harnesses PLMs for downstream NLP tasks.This approach differs from the traditional "pre-training then fine-tuning" paradigm, which requires updating the model parameters using labeled datasets.Unlike the conventional "pre-training then fine-tuning" approach, where model parameters are updated using labeled datasets, "pre-training then prompting" enables the direct inference of test data labels.This is achieved by providing LLMs with a few input-output examples, followed by test inputs, without requiring parameter updates.This innovative paradigm streamlines the adaptation of LLMs to specific tasks, offering a more efficient and flexible approach to NLP problem-solving.
The concept of prompt-based few-shot fine-tuning has garnered considerable attention.This approach aims to enhance the few-shot learning capabilities of PLMs by integrating label-specific verbalizers and prompts tailored to align with the language models' architecture [13].Prompt-based few-shot fine-tuning, which adapts models using limited examples within a structured prompting format, has been extensively researched for moderately sized PLMs like BERT [14].For instance, in [7], the authors introduced Pattern-Exploiting Training (PET), a novel semi-supervised training approach designed to enhance the performance of language models in low-resource settings.PET operates by transforming input examples into cloze-style phrases, which serve as prompts to guide the language model's understanding of the task at hand.These reformulated phrases are then leveraged to assign soft labels to a substantial pool of unlabeled data.
In the study described in [13], the authors adapted PET for tasks involving the prediction of multiple tokens.They demonstrated that when combined with ALBERT, both PET and its iterative variant (iPET) surpassed the performance of GPT-3 on SuperGLUE, even when trained with only 32 training examples.PET and iPET achieved this superior performance while utilizing only 0.1% of the parameters of GPT-3.Additionally, the training process with PET was completed within several hours on a single GPU without the need for expensive hyperparameter optimization.These authors also highlighted that similar levels of performance could be achieved without unlabeled data.
The authors of [15] introduced LM-BFF ("Better Few-Shot Fine-Tuning of Language Models"), a collection of straightforward and complementary techniques designed for fine-tuning language models using only a small set of annotated examples.Key components of their approach include (1) prompt-based fine-tuning coupled with an innovative pipeline for prompt generation automation, and (2) a refined method for dynamically and selectively integrating task demonstrations into each context.These authors conducted a systematic evaluation across various NLP tasks, encompassing classification and regression, to assess the few-shot performance of their methods.Their experiments reveal significant improvements over standard fine-tuning approaches in low-resource settings, with absolute performance enhancements of up to 30% and an average improvement of 11% across all tasks.The authors of [16] proposed a neural model based on multilingual BERT for domainagnostic multilingual FND.They conducted cross-domain transfer experiments to assess language-independent feature transfer.They also introduced a multilingual multi-domain FND dataset comprising five languages and seven different domains to aid research and development in resource-scarce scenarios.For our investigation of domain-agnostic FND, we utilize the English dataset from [16].
There have also been studies employing prompt-based learning to alleviate the impact of FND.A study by the authors of [17] proposed a novel prompt-based curriculum learning framework to combat the COVID-19 infodemic.This framework leverages prompts and curriculum learning to address challenges like class imbalance and few-shot learning in text classification tasks.They demonstrated this framework's effectiveness and potential for broader applications in low-resource settings.More closely related to our work is the study reported in [18], where the authors tackled the challenge of detecting fake news across multiple domains.They introduced a novel approach called Prompt Learning for Low-Resource Multi-Domain Fake News Detection (PLDFEND).This method incorporates domain-aware and relational learning into the prompt-learning framework to enhance its effectiveness.Additionally, they optimized a verbalizer module to improve label word coverage and accuracy, particularly in zero-shot scenarios.Through comprehensive experiments in both resource-constrained and resource-rich settings, these authors demonstrated the effectiveness of PLDFEND in detecting fake news across multiple domains.
Unlike the methodology outlined in [17], our cross-domain approach is based on the GPT2 model as the foundational model, with examples drawn from diverse domains for both the training and testing phases.This strategy prevents the model from acquiring domain-specific characteristics, such as writing styles unique to particular domains like politics.Furthermore, our approach differs from that of [18] in several key aspects.Firstly, we utilize GPT2 as our foundational model, known for its autoregressive left-to-right approach, where the model predicts the next word in a sequence based on the preceding ones.Secondly, we assess the effectiveness of various contextualized prompt designs through ablation studies to measure their impact on performance.The proposed framework demonstrates high accuracy in both domain-specific and domain-agnostic scenarios.Our model outperforms current SOTA models and shows promising results in zero-shot settings for domain-agnostic feature transfer.

Methodology
An overview of our proposed techniques is shown in Figure 1, with each component of each framework being explained in the following text.We consider the framework in Figure 1a to be a strong baseline model.We measured the accuracy, weighted precision, recall, and F1-score (Section 4.3) as evaluation metrics in all of our experiments.

Task Formulation
Our work tackles FND as a zero-shot binary text classification problem, leveraging the capabilities of PLMs like GPT2 [19].We achieve this using prompt-based fine-tuning.This framework is agnostic to the specific PLM chosen, allowing for flexibility in applying it to various pre-trained models.Our method employs a prompt-based approach to adapt the model for the specific task of fake news classification.In text classification, the aim is to categorize an input text The prompt-based fine-tuning approach utilizes a template P(h) to guide the model toward the desired task.This template acts as a guiding instruction, for example, in classifying news articles.Additionally, a verbalizer is used in our model.The verbalizer consists of relevant words from the vocabulary V associated with each class.It essentially maps the model's predicted word probabilities to task-specific labels.This mapping function, denoted as f : v ∈ V → y ∈ Y, translates predicted word probabilities into class probabilities, effectively transforming the classification task into a cloze-style language modeling problem.
We can illustrate this process with the following example.Given a sequence of tokens x i , the process entails constructing a prompt input by concatenating it with a standardized prompt sentence.Formally, x i + prompt is defined as x i appended with the prompt sentence: "The label of this news is ".Within this formulation, a masked token is utilized to signify the position where the model predicts the class label y i , belonging to the set Y, for the input x i .Subsequently, the x i + prompt is inputted into the language model (LM) to generate a probability distribution for each word v in the vocabulary V, representing potential candidates to fill the masked position.Prior to it being inputted into the LM, the input text is encoded alongside the prompt, employing necessary truncation and padding techniques to ensure that the entirety of the prompt is integrated within the model's input.Then, the probability of label y for the x i prompt can be calculated as follows: where h is the hidden representation.The training process involves sampling a subset of labeled data while optimizing the model's parameters using binary cross-entropy loss (BCEL) between the predicted label (i.e., the label j with the highest probability) and the true label distribution: The model is iteratively trained for three epochs.In the second scenario, we chose to utilize the conditional context-aware prompt approach (Figure 1b) to explore the impact of incorporating domain information as contextual input data for the model.Like the baseline approach (Figure 1a), this approach constructs the input text by concatenating the provided news text with a prompt.However, this prompt now includes a placeholder denoted by [DOMAIN].For example, consider the news text "Celebrity announces shocking news" and the prompt "This news is about [DOMAIN], the label of this news is".During training, the [DOMAIN] placeholder is dynamically replaced with the actual domain label (e.g., "Entertainment" in this case) provided in the dataset.This results in the following final input template: "Celebrity announces shocking news.This news is about Entertainment, and the label of this news is".This approach provides the model with both the news content itself and the additional context of the news domain, enabling it to potentially leverage both factors for more informed predictions.In the following section, we outline the experimental setup, dataset, evaluation metrics, and baseline models utilized in this study.

Experiments
In this section, we present the experiments that were conducted to evaluate the effectiveness of the proposed framework.

Experimental Setting
We conducted all experiments on NVIDIA A100 GPUs using PyTorch v2.4.0, a popular DL framework.Our dataset (Section 4.2) included real and fake news articles from various domains with varying lengths.To ensure a consistent input size and facilitate training, we truncated or padded these news articles to a maximum of 256 tokens, focusing only on the first 256 words.This choice was made after an analysis revealed that only 7 articles exceeded this limit by a small margin, while 1156 articles were shorter.This truncation/padding approach allowed us to maintain uniformity across the dataset.Additionally, using the first tokens of the English-language dataset yielded the best performance in our previous experiments [20].We employed the Adam optimizer, a widely used adaptive learningbased rate optimization algorithm that has been proven to be effective for various DL tasks.The learning rate was set to 3 × 10 −5 (an empirically defined value).Such hyperparameters in our model were primarily chosen through trial and error.We utilized the small variant of the GPT2 model, with 117 million parameters.Unlike traditional methods predicting class labels directly, we framed the text classification task as a conditional generation.This approach leverages prompt engineering, incorporating a specific (context-aware) prompt with the input text.This prompt influences how the model interprets the data.Subsequently, the model generates a sequence of tokens representing the predicted class label (e.g., "fake" or "real").

Dataset
A key challenge in FND research lies in the domain specificity of existing datasets.Many resources concentrate on a particular domain, such as politics, neglecting the need for materials that cater to a wider range of topics.To address this limitation and promote the development of domain-agnostic capabilities in FND models, De et al. [16] introduced the multi-domain fake news TALLIP dataset.This comprehensive dataset encompasses 4900 news articles, spanning diverse domains (http://www.iitp.ac.in/~ainlp-ml/resources/data/TALLIP-FakeNews-Dataset.zip,accessed on 12 January 2024).Our present study primarily investigates monolingual domain-dependent transfer learning, focusing on leveraging an English subset of this dataset.Data statistics are tabulated in Table 1.To measure the distinctions across these domains, we examined the distribution of topics among them.Figure 2 illustrates notable variances in the most commonly utilized words.Figure 2 shows the dominant themes within each domain.Business is characterized by financial terms like "stock", "market", and "Brexit".Education is associated with academic concepts such as "school", "student", and "teacher".Technology is linked to digital elements including "YouTube", "Google", and "digital".Entertainment emphasizes celebrity terms like "teen", "travel", and "fan".Politics is dominated by names of prominent figures such as "Trump", "Clinton", and "Obama", alongside terms like "president".Sports focuses on athletes, teams, and competitions, represented by words like "sport", "team", and "player".Finally, celebrity culture is highlighted in the last cloud with names like "Kim Kardashian" and "Taylor Swift", alongside words related to fame and lifestyle.

Evaluation Metrics
We employed four key metrics to assess the performance of our model: accuracy, precision, recall, and F1 score.We report in this paper the average precision, recall, and F1 score values, weighted to account for potential class imbalance.

Baseline Models
We compare our work with a number of different baselines, both traditional and DL-based approaches, which we detail as follows.
• mBERT [16]: mBERT is a multilingual BERT model with an initial learning rate of 2 × 10 −5 and a maximum sequence length of 128 tokens.The results reported here are adopted from [16].

•
Semantic graph-based topic modeling [21]: This technique involves a pre-trained XLM-RoBERTa language model and neural topic modeling to extract contextual information and global linguistic features.The results reported here are adopted from [21].• GPT2: In this model, initially, the input text x i + and the prompt are combined and processed through a natural language template.Subsequently, the model generates a prediction by completing the blank space, and the output is categorized into the default class (fake/real) using the verbalizer (as shown in Figure 1a).

Results
In this section, we describe the experiments that were conducted and discuss the results that were obtained.We carefully designed a broad set of experiments to answer the research question initially defined in Section Contributions of This Work.In doing so, we performed the following experiments: (1) cross-zero shot transfer learning analysis; (2) prompt-engineering sensitivity analysis, carried out by examining various prompt design choices; and (3) few-shot transfer learning analysis to partially mitigate the challenge of domain shift.To evaluate the effectiveness of the proposed model, we utilized the TALLIP corpus described in Section 4.2.After reporting the outputs of these experiments, as outlined in Section 4, we will discuss the results.

Overall Performance
To assess the models' generalizability, we conducted a cross-domain evaluation experiment, which we will discuss in this subsection.In this setup, models were trained on a specific domain (politics in this case), and their performance was evaluated in a different domain (celebrity news).Each model was trained in a zero-shot learning setting, wherein once the model was trained, we evaluated the same model on a completely different domain.Table 2 presents the performance comparison between the proposed context-aware prompt-based GPT2 approach and the baseline methods.We focus on accuracy for this comparison due to the following two key factors: (1) the baseline methods primarily relied on accuracy as the main evaluation metric, a choice which facilitates a direct comparison between the approaches, and (2) as shown in Table 1, the dataset exhibits a relatively balanced distribution between real and fake news articles.In such scenarios, accuracy remains a reliable performance indicator.As observed in the results, the models' accuracy on the celebrity domain fell somewhat below the chance level.This phenomenon, known as negative transfer, occurs when knowledge learned from the source domain (e.g., politics) hinders performance on the target domain (celebrity).While negative transfer is a potential risk, it wouldn't be a primary concern if some degree of transfer learning is observed.Additionally, perhaps the poor performance in the celebrity domain is because of the difference in the number of samples between the training set (e.g., business, with 105 training samples) and the test set (e.g., celebrity, with 340 testing samples), which can certainly contribute to the poor performance of the models, especially when the test set contains a larger number of samples from a different domain compared to the training set.The disparity in sample size between the training and test sets, combined with the presence of a different domain (celebrity) in the test set, can cause a domain shift.A domain shift occurs when the test set's data distribution differs significantly from the training set's.In this case, the model may struggle to generalize from the smaller training set to the larger test set with a different domain, leading to poor performance.Encouragingly, the context-aware GPT2-D model achieved significantly better performance than other models, including the strong baseline GPT2 model.This substantial improvement suggests that augmenting the natural language prompts with contextual information has the potential to enhance the model's cross-domain transferability.Our cross-domain transfer learning experiments yielded significant performance gains when transferring knowledge between seemingly disparate domains.This was observed for transfers between politics, education, and technology and between entertainment, business, and sports.These findings support two key hypotheses: (i) Despite domain-specific vocabulary and styles, these domains share a foundation in core language structures and grammatical rules.This underlying similarity allows the model to extract informative domain-independent features during training.These features, encompassing general language patterns and structures, are crucial for effectively classifying news articles from different domains.This observation aligns with the work in [16], which demonstrated the utility of domain-independent features for accurate fake news classification across domains.(ii) Our context-aware GPT2-D prompt method appears to play a crucial role in these successful transfers.These prompts provide domain-specific cues that guide the model towards interpreting the target domain data effectively, further enhancing its ability to leverage the transferable domain-independent features.
As our context-aware approach outperforms the context-free one and other baselines, this confirms that enhancing the prompt by enriching it with context knowledge (domain information) can be used effectively for cross-domain zero-shot transfer learning in FND and other related areas.

Cross-Domain Zero-Shot Transfer Learning Analysis
This section evaluates the models' abilities to perform zero-shot transfer learning across different domains.In this setting, each model was trained on a comprehensive dataset encompassing all domains except the target domain used for evaluation-the model was exposed to all domains except the target domain during training.The target domain was then used for evaluation.For example, the training set may contain domains like politics, education, technology, and sports and the test set may contain news about business domains.This approach assesses the model's capability to generalize and apply knowledge to unseen domains.Table 3 shows the results for the baseline GPT2 model and the context-aware GPT2-D model.The GPT2-D results are promising for several domains.This model achieves very high accuracy (↑ 98%) on business, education, technology, entertainment, and politics, demonstrating its ability to leverage transferable knowledge across these domains.This aligns with our previous findings regarding the effectiveness of domain-independent features and the underlying similarities in core language structures.However, its performance in sports and celebrity domains is lower (at around 91% and 88%, respectively).This suggests that while this model can capture some generalizable features, specific terminology and stylistic nuances unique to these domains might require additional training data or domain adaptation techniques for optimal performance.

Prompt-Engineering Sensitivity
Zero-shot learning is particularly sensitive to prompt design.Even minor variations in how a language model is instructed can significantly impact its performance [13,22,23].This subsection investigates the susceptibility of the best-performing GPT2-D model to variations in prompt design, specifically focusing on the verbalizer and template formatting.

•
Verbalizer wording: We examine the impact of alternative verbalizer labels by comparing our default choices of "fake/real" with "false/true".Our findings demonstrate that the performance of the GPT2-D model is influenced by the specific wording used for the target concept (i.e., veracity).The observed difference in performance between the two labelling schemes (i.e., "fake/real", Table 2, vs. "false/true", Table 4) has a moderate to potentially major impact on the model's accuracy.This sensitivity to the phrasing of target words highlights the importance of careful label selection when applying the GPT2-D model to an FND task.

•
Template formatting: We explore two modifications to the prompt template: -Quote addition: We introduce quotation marks around the prompt text, in contrast to the approach of Schick et al. [24], where quotes around input text were omitted.Table 5 presents the model's performance after surrounding the prompt with quotation marks.Compared to the original results (shown at the bottom of Table 2), a noticeable decline in accuracy is observed across most categories.Therefore, adding quotation marks seems to negatively impact the model's performance, at least at this level of analysis.

-
Prompt wording variations: This analysis examines the influence of varying the prompt wording on the performance of the GPT2-D model for FND.As illustrated in Tables 6 and 7, the prompt wording significantly influences the performance of the GPT2-D model.While all variations lead to a decline in accuracy compared to the original setup, the extent of the impact varies.Simpler prompts with less specific terminology seem to perform better.

The Challenge of Domain Shift
When applying a model trained on one domain (the source domain) to a different domain (the target domain), a phenomenon known as domain shift [25,26] can occur.This arises because the source and target domains often have distinct characteristics.For example, political data may emphasize persuasion, emotions, and public figures, while business data may focus on factual information, financial analysis, and market trends.This shift can lead to the model performing well on some aspects of the target domain, such as identifying key entities, but struggling with others, such as understanding specialized terminology.Moreover, domain shift can be caused by differences in the vocabulary and the frequency of words between different domains (Figure 2).When the language used in one domain differs significantly from that used in another, it can lead to inconsistencies in the data distribution, affecting the performance of DL models trained on one domain and tested on another.This phenomenon can result in domain-specific biases and challenges in generalization.
The results presented in Table 8 underscore the sensitivity of the GPT2-D model to domain alignment in the training and testing phases.While the model exhibits perfect accuracy (100%) when trained on a specific domain and tested on another domain, the misclassifications observed in other cross-domain scenarios reveal significant challenges in domain adaptation.For example, the model's flawless performance in the technology-toeducation domain shift suggests that the underlying features and patterns learned from the technology domain are effectively transferable to the education domain.This success might be due to similarities in the content or structure of the news articles across these domains, allowing the model to generalize well despite the domain shift.Conversely, the model's misclassification of articles when trained on sports and tested on business, or when trained on education and tested on technology, indicates that domain-specific characteristics heavily influence the model's accuracy.For instance: • Sports-to-Business Shift: The misclassification of the "Justice League" trailer article as "legit" when it was actually "fake" highlights the model's difficulty in handling domain-specific cues that were not present in the sports training data.The mismatch between sports and business content likely led to confusion in classifying the legitimacy of business-related news.• Education-to-Technology Shift: The error in classifying the Fossil Group smartwatch announcement as "fake" suggests that the model may struggle with content that deviates significantly from the educational domain.This discrepancy points to the model's potential inadequacy in capturing the subtleties of technology-related news when trained on educational data.• Business-to-Entertainment Shift: The misclassification of Alec Baldwin's article as "legit" instead of "fake" reflects the challenges of transferring learned features from business to entertainment contexts.This result implies that the model's training on business news did not effectively prepare it for the nuances of entertainment content.
These findings emphasize the importance of aligning the training and testing domains to ensure accurate classification.They also suggest that while the model can perform exceptionally well under certain conditions, domain-specific training remains crucial for achieving reliable performance across diverse contexts.To mitigate the effects of domain shift, domain adaptation techniques can be employed to bridge the gap between the source and target domains.A common approach involves incorporating a small amount of labeled data from the target domain into the training data of the source domain (few-shot learning).This allows the model to partially adapt to the target domain by learning from a limited number of target-domain examples.Our experiment utilized this approach with the GPT2-D model, which demonstrated superior performance in this study (these results are depicted in Table 9).For this experiment, we randomly sampled 30 data points from the training set of the target domain and added them to the training set of the source domain.As an example, if the source domain was "business" and the target domain was "celebrity", we would sample 30 examples from the "celebrity" training set and include them in the "business" training data.Subsequently, the model was evaluated on the unseen test set from the target domain (the celebrity domain in this case).

Discussion
By analyzing Table 2, we find that the context-aware GPT2-D model achieves significantly better overall performance.It is interesting to note that the GPT2-D results are particularly outstanding for most cross-domain transfers.However, for some specific transfers, like sports and education and technology, politics and entertainment, and technology and celebrity, the best results are obtained by mBERT using the full-tuned mBERT model [16].Table 3 shows that the baseline prompt-based GPT2 model results are lower than those achieved by GPT2-D across all metrics.This can be attributed to the influence of domain knowledge augmentation on the decision-making process, revealing useful patterns for the model.Thus, this reinforces the hypothesis raised previously that in the presence of contextualized domain information, the model captures useful information from the text and learns how to use it to perform the task.However, as mentioned above, this information does not help the model better understand the domain-specific features of some domains.The ablation study conducted in this work yielded very interesting results (Tables 4-7), particularly regarding prompt-engineering sensitivity performance.The results show that using different prompts did not yield significant improvements, and no consistent patterns were found.In some cases, the model performed better than with the default prompt.When combined with different label words (a strategy called "verbalizer wording"), performance improved for some domain transfers.Therefore, although there is a slight performance difference between different prompts, it is not consistent, and it is necessary to search and learn better prompt representations for the task.Furthermore, an analysis of the various models' outcomes suggests that prompt-based models (and potentially other baseline models performing cross-domain transfer) "suffer" from the domain-shift problem, which occurs when the training data and test data come from different distributions.In certain domains, only a small amount of labeled data are available.Moreover, the ways in which information spreads differ significantly across various domains [27], making it quite challenging to perform cross-domain zero-shot learning.
As a straightforward way to address domain shift here, we added samples from the test set domain to the training set and trained the model using few-shot learning.As shown in Table 9, the model performance improved significantly, and we suspect that increasing the few-shot samples would lead to even better results.The scores of the considered domains (i.e., domains that showed poor performance in the cross-domain experiments shown in Table 2) can be presented as a grayscale heat map.Future research could explore techniques to enhance domain adaptation for improved performance on more nuanced domains like sports and celebrity news.Additionally, investigating the specific transferable features learned by the model could provide valuable insights into the underlying mechanisms facilitating cross-domain generalization.Moving forward, our future work will aim to expand on verbalizers, building on findings that demonstrate how altering verbalizer wording can impact performance to varying degrees.We intend to explore dynamic approaches to optimize the utilization of verbalizers effectively.Additionally, we plan to address the challenge of transferring knowledge to new data through enhanced adaptation techniques and additional information integration.Our initial results indicate that incorporating domain-specific information into prompts significantly enhances contextual understanding, facilitating knowledge transfer across domains.Moreover, the aspects of stability and reliability are critical for the practical deployment of machine learning models and warrant thorough investigation.This study did not extensively evaluate these factors due to scope-and resource-related constraints.However, future work will aim to address this by conducting multiple trials with various data partitions and incorporating external datasets.Such an approach will provide a more robust assessment of the model's performance, ensuring consistency across different scenarios and data conditions.This will significantly contribute to the model's validation, offering a comprehensive understanding of its stability and reliability in real-world applications.

Leveraging the Proposed Solution for FND in Social Networking Services
The proposed framework can be seamlessly integrated into online social media platforms to provide an automated solution for detecting and flagging fake content.By embedding our model within these platforms, it becomes possible to continuously monitor and assess the authenticity of shared information in real time.This integration serves as a useful tool for identifying and mitigating the spread of misinformation while respecting the principles of freedom of speech.The framework can operate in the background to highlight potentially misleading or false content, enabling users to make more informed decisions.This proactive approach would not only enhance the credibility of information on social networks but also support the broader goal of maintaining a trustworthy digital information environment.

Conclusions
This work investigated the effectiveness of contextualized PLMs for cross-domain FND.We addressed whether a model trained in one domain could transfer its knowledge to another.The proposed model, a GPT-based approach with contextualized prompt-based learning augmented by domain information, achieved promising results.Fine-tuning the model on a multi-domain dataset led to superior performance compared to a model lacking auxiliary information.We established benchmark results for English-language FND using the TALLIP dataset.Furthermore, ablation studies explored the impact of prompt design choices, and a comparison demonstrated the effectiveness of the contextualized prompttuning approach against context-independent models in zero-shot cross-domain settings.The context-aware GPT model exhibited significant improvements over strong baselines in zero-shot cross-domain tasks.These findings suggest that feature representations learned by contextualized PLMs are highly transferable across different domains, offering a valuable approach for tackling the challenge of FND in diverse settings.Moreover, while our experiments utilizing data from various domains have corroborated the presence of intraand inter-domain relationships, sharing knowledge across these domains can also introduce noise and domain shift.This is because the efficacy of transferring knowledge varies depending on the specific domains involved.Future studies may investigate strategies for refining domain adaptation to enhance performance in more nuanced domains, such as sports and celebrity news.Additionally, exploring the identification of specific transferable features acquired by the model could yield valuable insights into the underlying mechanisms that facilitate cross-domain generalization.

Figure 1 .
Figure 1. Figure (a) illustrates the prompt-based GPT2 baseline, while (b) presents the proposed context-aware prompt-based GPT2-D architecture.First, the input text x i plus the prompt is fed into a natural language template.The pre-trained language model (PLM) then makes a prediction by filling in the blank and mapping the result to the default class (fake/real) using the verbalizer.The key distinction is that the architecture in (b) employs a contextualized prompt by injecting domain-based information ([DOMAIN]) to enhance performance.

Figure 2 .
Figure 2. Word clouds for different domains.

Table 1 .
Class distributions of the TALLIP (English) dataset.

Table 3 .
Comparison of cross-domain zero-shot transfer learning using the baseline GPT2 model and the GPT2-D model.Both models were trained on a comprehensive cross-domain dataset excluding the training set of the target domain in order to assess their cross-domain generalization capability.

Table 4 .
Performance of the GPT2-D model using the default prompt with "true" and "false" label tokens.

Table 5 .
Performance of the GPT2-D model using the default prompt "This news is about [DOMAIN], the label of this news is" with quotation marks.

Table 6 .
Performance of the GPT2-D model using the prompt "This article discusses [DOMAIN], the classification of this article is".

Table 7 .
Performance of the GPT2-D model using the prompt "The topic of this news piece is [DOMAIN], with the assigned label being".

Table 8 .
A sample of misclassified classes obtained using the GPT2-D model.

Table 9 .
Few-shot experiment using the GPT2-D model (with the default prompt).The values highlighted with a gray background indicate the improved scores of the domains we experimented with.