Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection

: The explosive growth of the social media community has increased many kinds of mis-information and is attracting tremendous attention from the research community. One of the most prevalent ways of misleading news is cheapfakes. Cheapfakes utilize non-AI techniques such as unaltered images with false context news to create false news, which makes it easy and “cheap” to create and leads to an abundant amount in the social media community. Moreover, the development of deep learning also opens and invents many domains relevant to news such as fake news detection, rumour detection, fact-checking, and veriﬁcation of claimed images. Nevertheless, despite the impact on and harmfulness of cheapfakes for the social community and the real world, there is little research on detecting cheapfakes in the computer science domain. It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption content. Existing research focuses mostly on training and evaluating on given dataset, which makes the proposal limited in terms of categories, semantics and situations based on the characteristics of the dataset. In this paper, to address these issues, we aimed to leverage textual semantics understanding from the large corpus and integrated with different combinations of text-image matching and image captioning methods via ANN/Transformer boosting schema to classify a triple of (image, caption 1 , caption 2 ) into OOC (out-of-context) and NOOC (no out-of-context) labels. We customized these combinations according to various exceptional cases that we observed during data analysis. We evaluate our approach using the dataset and evaluation metrics provided by the COSMOS baseline. Compared to other methods, including the baseline, our method achieves the highest Accuracy, Recall, and F1 scores.


Introduction
In recent years, the amount of information and news has dramatically increased due to the convenience and development of social media. However, besides the benefit of its growth, it also significantly increases the quantity and impact of misinformation on individuals and society, which is one of the most dangerous things that threaten democracy, journalism, and freedom of expression. Fake news disturbs the community on a multimedia platform and causes fatal consequences in many aspects of reality and for the ordinary lives of many people. For example, fake news affected the 2016 and 2020 U.S elections.
Besides the spread of the amount of false information, the way of spreading misleading information to the community has also changed and evolved in many types and formations, making it more effective at and convenient for deceiving humans. For example, the enlargement and popularity of microblogging platforms such as Twitter, Facebook and Instagram has also increased the speed of spreading rumours and fake news since social media platforms are becoming more and more usual and necessary things in ordinary life for many people. Furthermore, controlling the content and veracity of posts on microblogging platforms is difficult since there is a large number of users on the standard platforms such as Facebook, Twitter and Instagram.
The blossoming of deep learning has opened new domains and technology, one of which is deepfake [1,2]. Deepfake has received attention from the computer vision community and is a powerful technique that can manipulate images/videos with high quality and that are hard to discriminate from unaltered ones. However, despite the usefulness and effectiveness of deepfake in swaying people's beliefs, one of the most prevalent and frequent ways of spreading disinformation is out-of-context photos, which use unaltered images in news or posts with false context.
Cheapfakes are a type of fake news that utilizes both images and new context. The danger of cheapfakes is that they are easy and cheap to make. While deepfakes use deep learning, which takes high technology and complexity to create, cheapfakes make use of simple and non-AI techniques such as photoshop, manipulating video speed, or unaltered images/videos from different events with false context, which makes it simple to create and more common.
Based on the MIT technology review (https://www.technologyreview.com/2020/12/ 22/1015442/cheapfakes-more-political-damage-2020-election-than-deepfakes/, accessed on 7 October 2022), in the 2020 U.S presidential election, deepfakes did not disrupt the US election, but cheapfakes did. Fazio [3] also warned of the dangers and explained why outof-context photos are compelling. First, photos are usually attached to news, and people are already used to them. Secondly, photos make people faster at retrieving an image-related event, making it feel more truthful. Lastly, by using photos, posts on social media platforms will receive more attention and help spread false information.
To meet the emerging requirements of having a good tool for cheapfakes detection and overcome the limitations of existing works, we propose several approaches that utilize multimodal representation learning techniques to overcome limitations. By combining several techniques, including text entailment, image text matching, and boosting algorithms, our methods have improved performance and assessed the performance of several methods in cheapfakes detection.

Related Work
This section briefly surveys fake news detection methods, including cheapfakes detection and other subdomain methods.

Fake News Detection
Fake news has existed for a long time, even before the internet appeared. Recently, fake news has been one of the most prevalent ways to spread disinformation to human society. There are many research and public datasets on this issue. Usually, the research topic and public dataset focus on the textual type of fake news. LIAR [4] and FEVER [5] are two famous public datasets where data are collected from the news website. Each consists of one statement and a given claim, with multiple grades to determine the relation and veracity. Classification news-based linguistic semantic features [6,7] and data mining [8,9] are two traditional methods for determining the veracity of the news based on the semantics of the given text. This approach relies on training and the given data, and cannot utilize external knowledge to verify the news. Based on the development in the data and methods of the knowledge graph, Refs. [10][11][12] make use of the knowledge graph as external knowledge. This approach is ideal in theory, but in reality the knowledge graph suffers from a lack of relation between entities and still has a long way to develop. Although the task usually focuses on textual fake news, there are many implications for the impact on detecting disinformation in both images and text.

Rumour Detection
Alongside fake news detection, rumour detection also has a long history. Rumours refer to information not confirmed by official sources that spreads on social media platforms. Unlike fake news, which consists primarily of textual information, rumours include many types of information such as reactions, comments, attached images, user profiles, and platforms. In rumor spreading, followers play an essential role when directly or directly contribute 86 exponential increments of rumors by forwarding news with or without their comments whose content could distort the original one. Hence, understanding the following (i.e., a series of comments tailored from original news), especially in social networks, can help filter out fake news. Because data collected from social networking services can contain more attributes than data collected from news websites, such as user profiles, attached relevant posts, reactions, and comments, the data are rich and have complex attributes. The following research also has various approaches compared to fake news detection. Tree structure, sequence network [13,14] and graph neural network [15,16] are common approaches for combining and extracting correlation features on sequence and time-series data from microblogging.

Fact Checking
Fact-checking is the task of classifying the veracity of a given claim. It is a timeconsuming task to verify a given claim. People need to search and check the source website's reputation and impact. Some given claims even need several professionals and several days or hours. Many techniques have been researched and developed to reduce manual fact-checking to settle this issue. There are two popular dataset types for factchecking: the first is to verify a given pair of claims and evidence [17]. Prior research has utilized text entailment [18] to compare semantic relations between claims and evidence. Liangming et al. [19] also utilized question-answering by generating questions from the given claim. The second utilizes data on a large scale, and processes based on the technique of the knowledge graph [20].

Verify Claim about Images
Besides fake news detection, rumour detection, and fact-checking, verifying claims about an image has also received attention in recent years. While the above task mainly verifies textual claims or posts, verifying the claim about the image focuses on the post/claim/caption with the attached image. This is a challenging task since verifying the veracity of the claim itself is hard, but verifying if the attached image is related or satisfactory for concluding the truth or not is even more challenging. Refs. [21][22][23] extract textual captions and attached images through corresponding pre-trained models then concatenate and infer through a linear layer for classifying. La et al. [24] utilized an image-text matching method to measure correlations between captions and images. Dimitrina et al. [25] also took advantage of Google image search to enrich information (website, categories of news, and images) and then made use of TF.IDF to predict veracity.

Multi/Cross-Modal Representation Learning
In the field of multimodal reasoning and matching, many techniques have been developed to resolve various challenging tasks such as Visual Question Answering (VQA) [26], Image Captioning [27], Text-to-Image [28], and Image-Text Matching [29]. Still, there is much research on the cross-modal between images and text. To verify claims about image tasks, many methods use the simple technique of extracting features of images through Convolution Neural Network and concatenating them with textual features to classify the truthfulness of news. This technique is simple yet depends on the training dataset, which cannot be generalized in reality and for other aspects and types of news.

Dataset
This section will briefly introduce the Out-of-Context Detection Dataset in COS-MOS [30], which we used to assess and evaluate our proposal's performance. The dataset was collected from news websites (New York Times, CNN, Reuters, ABC, PBS, NBCLA, AP News, Sky News, Telegraph, Time, DenverPost, Washington Post, CBC News, Guardian, Herald Sun, Independent, CS Gazette, BBC) and fact-checking websites. The dataset consisted of the English language in 19 categories and did not consist of digitally-altered or fake images. The statistic is shown in Figure 1 and Table 1. We recommend readers read [31] for more details.  Train/Validate Set: In the training set, captioned images were collected from the news website. Each captioned image consisted of one image, one or multiple attached captions, source URL, entity list in a caption, modified caption in which each entity is replaced by corresponding ontology, and location of 10 bounding boxes extracted by a pre-trained Mask-RCNN on MS COCO. Training data did not contain an out-of-context captioned image. Every captioned image was not-out-of-context and did not have a context label. Training data consisted of around 200,000 images with 450,000 matching textual captions. Furthermore, 20% of that was split for the validation set. The example of the captioned image of the training set is illustrated in Figure 2. Test Set: In the test set, captioned images were collected from both news websites and fact-checking websites. Like the train set, each captioned image of the test set consisted of an image, captions, source URL, entity list, modified caption, and bounding box. However, each captioned image contained two corresponding captions in the test set. These captions always contained one caption not-out-of-context; the remaining caption could be out-ofcontext or not-out-of-context. Each captioned image also had context annotation to point out if that captioned image consisted of out-of-context captions or not. In summary, the test set contained 1000 captioned images, which included 1000 images and 2000 textual captions. The example of the captioned image of the test set is illustrated in Figure 3.

Proposed Method
In this section, we will introduce COSMOS baseline [30], our motivation, and explain and describe our methods.

COSMOS Baseline
In prior research on image and news veracity classification, the method usually aims to utilize multi-modal by extracting features of text/captions and attached images through a pre-trained convolution neural network, LSTM [32] or BERT [33], layer and combine these features by concatenating or sum function with the appropriate objective function. This approach can take advantage of multiple datasets such as imagenet, MSCOCO, STS, and MNLI. . . for the basis of understanding and representing semantic information of data and fine-tuning other news datasets to improve performance.
Besides the advantage of prior research, it is also limited in terms of the dataset's attributes. Most of the prior work uses fine-tuning on the new dataset, which makes it limited in many respects, such as in categories and characteristics of news, and cannot cover all subjects or situations not included in the dataset.
In COSMOS, the author aims to match the caption with the most correlated object in the image by utilizing self-supervised learning. To do this, the author first uses Mask-RCNN [34] on MSCOCO [35] and selects the top 10 ROIs (Region of Interest) with the highest detection score and additional features of the entire image. For text pre-processing and processing, the author first makes use of NER (Named Entity Recognition) to generalize captions and then infers through USE (Universal Sentence Encoder) [36] to extract caption embedding. Next, the author infers the bounding box and caption embedding through a linear layer for mapping to the same dimension. The paper also uses max margin ranking loss [30] as objective/loss function using the equation: where S r IC , S m IC is the measure of similarity between a random caption-image pair and a matching caption-image pair, and α is the margin parameter. This measure is calculated by the maximum dot function between 11 ROIs and matching/random caption. The similarity measure function is illustrated as Equation : where b i is the features of the proposal bounding box and c is the features of the caption. At testing time, for each captioned image (caption 1 ,caption 2 ,image), the COSMOS method uses the simple if else rule to determine out-of-context captioned images: where IoU(B IC 1 , B IC 2 ) is the intersection-over-union of two bounding boxes having the largest value of similarity measure with the corresponding two captions; S sim (C 1 , C 2 ) is the similarity measure defined in cosine space, and t i , t c is the fixed threshold of IoU(B IC 1 , B IC 2 ) and S sim (C 1 , C 2 ). By matching and comparing two captions with the corresponding object, the author can assess if two captions mention a related subject/object or not (determined by IoU(B IC 1 , B IC 2 )). If two captions mention a related subject/object and have uncorrelated semantic similarity (determined by S sim (C 1 , C 2 )), then the given captioned image is out-ofcontext. The other situation is not-out-of-context.

Motivation
By training the model matching caption with the correlated object in the image and utilizing a pre-trained large-scale textual dataset, the method can utilize the semantic features and understanding of another large-scale dataset, which make it less prone to overfitting on other tasks or datasets of news or fact verification.
Besides the advantages of the COSMOS baseline, the weakness of this method is that by utilizing features of the entire image of Mask-RCNN on MSCOCO, it cannot optimize the express context of the entire image because the Mask-RCNN's task is object detection, not descripting. Moreover, the caption usually mentions multiple objects and highly correlates with the context image.
Based on the insufficiency of the COSMOS method when comparing the image with the caption, in this paper, we propose and evaluate a method that utilizes a more optimized method to express content features of the image and better extracts the semantic relation between two captions. Furthermore, instead of defining a rule for determining out-ofcontext captioned images, we combined results from multiple methods by making use of boosting techniques to improve performance.

Methodology
This paper proposes two approaches to measuring the correlation between image and caption: image captioning and image-caption matching.
Image Captioning: For the image captioning approach, we aim to utilize [37] to generate the content description of an image. We can use a pre-trained large-scale dataset on the STS [38] task (Semantic Textual Similarity) to measure the correlation between caption and image by converting the image's content to textual form.
Image-Caption Matching: For the image-caption matching approach, we utilized a trained model of image-text matching on the MSCOCO dataset [35] to measure the correlation between caption and image. In this paper, we used the Visual Semantic Reasoning [39] method to measure the similarity between image and caption. See Figure 4 for illustration. . Illustration of boosting with image captioning method. First, the image will be inferred selfcritically [37] to obtain a description of the image in textual form. Next, RoBERTa(MNLI) is utilized to extract the correlation between caption 1 , caption 2 , and image (NLI(caption 1 , caption 2 ), NLI(caption 1 , caption image ), NLI(caption 2 , caption image )).To overcome the difference between training data and testing data issues and improve performance, we take advantage of the boosting algorithm on the part of the testing data to combine results from our proposal and the COSMOS baseline.
The VSRN (Visual Semantic Reasoning) [39] method utilizes margin ranking loss as the objective function. The margin ranking loss objective is the correlation measurement of the matching caption-image, which is higher than the non-matching caption-image and not trying to make matching caption-image have a matching score higher than the threshold. As shown in Figure 5, the matching caption image's matching score has a different range of values. It can have a lower value compared with different captions and images that do not match each other. However, compared to the same image with another caption that is not matching, the correlation measurement of the matching caption image is higher than that of the non-matching caption image. Based on this attribute of the VSRN method and margin ranking loss, we normalized the matching score using Equation (4) to overcome this issue. See Figure 6 for illustration.
whereŜ defines the normalize matching score, and r defines the random index that satisfies C r = C and I r = I. By subtracting the mean of the matching score from the N sample, the result can express the correlation degree of the given matching image caption compared with other non-matching image captions. Figure 5. Example of the matching score between image and caption. Green expresses matching caption and red expresses non-matching caption. Based on the attribute of margin ranking loss, compared to one image, matching captions have a higher score than the non-matching caption. Not every matching caption always has a higher matching score than a non-matching caption.
Hence, to estimate the correlation between two captions better, instead of using only cosine similarity measures from other methods trained on the STS task [38], we also used other methods on the NLI task (Natural Language Inference) [40] to express the semantic relation between two captions. We chose SBERT-WK [41] and RoBERTa [42] to extract semantic relations between two captions.
One of the difficulties of the COSMOS dataset is that training/validation data have a different construct from testing data. In training data, each captioned image consists of only a not-out-of-context pair, and captions are always trustworthy news and match the image's context. While in testing data, data consist of out-of-context and not-out-ofcontext captioned images. The caption can be fake news, descriptions about the image, or match/mismatch with the image and other captions. Based on our experience, finetuning training data and evaluating directly on testing data gave poor results. We used boosting algorithms-which can utilize results from textual entailment (NLI, STS) and image-caption matching (image-text matching, image captioning) to increase the method's accuracy-on the part of the testing dataset to combine semantics understanding from multiple methods to improve performance and overcome the shift domain issue. We leveraged ANN and Transformer Encoder as boosting architecture. Six hundred captioned images were extracted as training data and 400 captioned images as evaluation data. Figure 6. Illustration of boosting with image-caption matching method. First, image, caption 1 , and caption 2 will be inferred through VSRN [39] and normalized by Equation (4) to obtain matching scores (Ŝ(I, C 1 ),Ŝ(I, C 2 )). In addition to enriching semantic correlation information between caption 1 and caption 2 , we make use of RoBERTa(MNLI) to extract the relation between two captions. Similar to the image captioning method, we take advantage of the boosting algorithm on the part of testing data to combine results from our proposal and the COSMOS baseline.

Experimental & Results
This section introduces the dataset and metric used to evaluate our proposed method. We compare our method to others on the same dataset and metric. The thoughtful discussion also raises the advantages and disadvantages of our method.

Working Environment
All our experimental methods were implemented on three GPUs NVIDIA Tesla A100 40 GB, Intel Xeon Gold 5220R CPU, and 256 GB RAM. We extracted 600 captioned images of testing data for boosting and 400 captioned images for evaluating performance.
We used the same settings to make it easy to compare each method's performance. We used an Adam optimizer with a 1 × 10 −3 learning rate, 4 × 10 −5 weight decay, and crossentropy loss for an updated model. We used simple ANN and a Transformers Encoder to boost the results.
We set the default target dimension for ANN to 64, fed-forward the activation layer (PReLU), and inferred through the linear layer to classify the captioned image.
For the Transformers Encoder, we set input features to 16 dimensions, two multi-head attention, and two layers to extract features. After that, we inferred through the linear layer to classify the captioned image.

Evaluation Metrics
To evaluate the effectiveness of our proposal, we used five metrics: accuracy, precision, recall, and F1-score with the following equation: where: •

Datasets and Compared Methods
We evaluated our proposals and other methods on 400 captioned image testing datasets. Table 2 and Figure 7 summarize the result of our proposal compared with other methods. Boosting with IoU(B IC 1 , B IC 2 ), S sim (C 1 , C 2 ), with Transformers Encoder Bold factor meaning best evaluation score.

Discussions
First, we made use of Spotfake [23] as a training baseline approach based on its simplicity-fine-tuning and concatenating visual and textual embedding to classify the veracity of the news. We leveraged Spotfake architecture on the given training and testing data of COSMOS. In particular, when training, we created out-of-context content by selecting captions and images from different sources' captioned images and not-out-ofcontext content from the same source captioned images. When evaluating, we classified both (caption 1 , image) and (caption 2 , image). If both the captions were not-out-of-context, the triplet (caption 1 , caption 2 , image) was not-of-context, and the other was out-of-context. The method gave poor results based on the different attributes between training and testing data, and the method could not overcome and generalize the issue.
Next, downstream from another dataset approach, we chose EANN. We used the same method from Spotfake to evaluate the performance-classify both (caption 1 , image) and (caption 2 , image). On the MediaEval2015 dataset [45], EANN could achieve a 71.5% accuracy point. However, when downstream of COSMOS, the method produced unqualified results, even though MediaEval2015 consists of a large corpus of textual news and various cases of misused images, similar to the COSMOS dataset. The current training and downstream approach to a given news dataset is limited in categories, domains, and types of news and may not perform well in reality.
Compared to the baseline, our methods improved the 6.5% accuracy score. Furthermore, in relation to Tankut et al.'s [44] research, our method has equal accuracy and has a higher recall and F1-score. Tankut et al. [44] took advantage of handcraft features by matching the most relevant fake news keywords (fake, hoax, fabrication, supposedly, falsification, propaganda, deflection, deception, contradiction, defamation, lie, misleading, deceive, fraud, concocted, bluffing, made up, double meaning, alternative facts, tricks, half-truths, untruth, falsehoods, inaccurate, disinformation, misconception) and alternated captions in testing datasets with fake words ( "was true" and "was not true") to compare semantic features. Our methods used various semantic understandings in computer vision and natural language processing on large-scale datasets to assess the correlation between the original image and caption. The impact of each image-text matching method is also present in our paper.
In Figures 8 and 9 we show a few examples of our false negative (FP) and our false positive (FN) predictions. As we can see in the false negative cases, the content of news and the abstract relation with the corresponding image are hard to distinguish, even by humans, and much news needs an expert or time with search tools to determine. For false positive cases, our method failed to distinguish between the image description (generated by humans) and false news.

Conclusions
We have presented and evaluated multiple approaches to the cheapfakes detection problem and conducted experiments on the COSMOS dataset. Our work evaluates the effectiveness of different image-text matching methods, which can leverage semantic features from large-scale datasets instead of fine-tuning and concatenating features from text and images, which makes methods limited in the attribute of a given dataset. Compared to the existing method for cheapfakes detection, we have proposed a method that takes advantage of attributes from the testing dataset instead of directly alternating and defines handcraft patterns based on human effort. Moreover, we have extended experiments of the same theoretical results previously described [43]. Compared to another approach, our methods achieve competitive results, which achieve equal accuracy and higher recall and F1-score. Overall, we believe that our method makes a valuable contribution towards addressing misinformation in news and social media.
In the future, we will consider abstract images that cannot explain or understand with popular image understanding methods without specific knowledge, such as a photo of an art painter, a personal event, a snapshot from a film, or a photo of a book cover. We also consider mapping images and captions into the third coordinator, where additional knowledge can bridge the semantic/knowledge gap between them. Not but not least, extending captions using domain knowledge (e.g., hugging face) to enrich the semantic content of captions and utilize content graphs extracted from images can be another promising research direction.  Data Availability Statement: The authors will make the data used in this research available on request.