Multiclass AI-Generated Deepfake Face Detection Using Patch-Wise Deep Learning Model

: In response to the rapid advancements in facial manipulation technologies, particularly facilitated by Generative Adversarial Networks (GANs) and Stable Diffusion-based methods, this paper explores the critical issue of deepfake content creation. The increasing accessibility of these tools necessitates robust detection methods to curb potential misuse. In this context, this paper investigates the potential of Vision Transformers (ViTs) for effective deepfake image detection, leveraging their capacity to extract global features. Objective: The primary goal of this study is to assess the viability of ViTs in detecting multiclass deepfake images compared to traditional Convolutional Neural Network (CNN)-based models. By framing the deepfake problem as a multiclass task, this research introduces a novel approach, considering the challenges posed by Stable Diffusion and StyleGAN2. The objective is to enhance understanding and efﬁcacy in detecting manipulated content within a multiclass context. Novelty: This research distinguishes itself by approaching the deepfake detection problem as a multiclass task, introducing new challenges associated with Stable Diffusion and StyleGAN2. The study pioneers the exploration of ViTs in this domain, emphasizing their potential to extract global features for enhanced detection accuracy. The novelty lies in addressing the evolving landscape of deepfake creation and manipulation. Results and Conclusion: Through extensive experiments, the proposed method exhibits high effectiveness, achieving impressive detection accuracy, precision, and recall, and an F1 rate of 99.90% on a multiclass-prepared dataset. The results underscore the signiﬁcant potential of ViTs in contributing to a more secure digital landscape by robustly addressing the challenges posed by deepfake content, particularly in the presence of Stable Diffusion and StyleGAN2. The proposed model outperformed when compared with state-of-the-art CNN-based models, i.e., ResNet-50 and VGG-16.


Introduction
Over the past decade, social media content, including photos and videos, has seen a remarkable surge driven by the widespread availability of affordable devices like smartphones, cameras, and computers.The proliferation of social media platforms has facilitated the swift sharing of such content, resulting in exponential growth of online material and easy accessibility for users [1].
Simultaneously, there have been significant advancements in machine learning (ML) and deep learning (DL) algorithms, which are highly efficient in manipulating audiovisual content [1].Unfortunately, this technological progress has also created and disseminated deepfakes, i.e., synthetic audio and video content generated using AI algorithms [2,3].The rapid development of deepfake technology poses a serious threat [4] as it can be utilized to spread disinformation globally and potentially sway public opinion.In instances such as election manipulation or character defamation, the ease of spreading false information can be exploited.
As deepfake creation becomes more sophisticated, the authentication and verification of video evidence in legal disputes and criminal court cases could become increasingly challenging [5].Ensuring the integrity and reliability of video submissions as evidence will demand significant scrutiny, particularly in the face of advanced deepfake techniques [6].Moreover, the exponential growth of social media content and the evolution of deepfake technology raises concerns about the potential misuse and manipulation of information, demanding further attention from researchers, policymakers, and the technology community [7].The production of high-resolution deepfake images relies on intricate algorithms commonly based on DL models like GANs.These complex DL techniques are crucial in creating realistic and convincing synthetic images [8].
The proliferation of deepfake technology gives rise to numerous concerns and potential dangers across various industries [9].One significant area impacted is cybersecurity [10], where the ability to manipulate facial photos convincingly raises alarms about identity theft, deception, and unauthorized access to sensitive information.Moreover, the widespread use of deepfakes poses a substantial risk to public trust, as malicious individuals can exploit this technology to create deceitful visual cues, propagate misinformation, or tarnish the reputations of others [11].Due to these issues, researchers and academics have been focusing on devising methods to detect and mitigate the adverse effects of deepfakes.By developing advanced approaches, they aim to safeguard individuals and organizations from the potential harms posed by this evolving technology [12].This involves harnessing the progress made in computer vision, machine learning, and forensic analysis to detect crucial indicators of image manipulation and effectively differentiate between authentic and manipulated facial images [13].
Various approaches have been put forward to detect deepfakes, and a significant portion relies on deep learning techniques [14].The United States Defense Advanced Research Projects Agency (DARPA) has initiated a media forensic research project to develop effective methods for detecting fake media [15].This endeavor reflects the growing importance of addressing the challenges posed by deepfake technology in safeguarding the authenticity and credibility of digital media content [15].Additionally, Facebook, in collaboration with Microsoft, has introduced an AI-based deepfake identification challenge.This joint effort signifies the industry's commitment to combatting the risks associated with deepfake technology by fostering the development of advanced AI solutions for detecting and countering deceptive media content [16].
Recently, numerous prominent techniques have been put forward for identifying fake images.However, these models often exhibit limited generalization capability, leading to a drop in performance when faced with the latest deepfake or manipulation methods.Akhtar et al. [17] considered Convolutional Neural Network (CNN)-based SqueezeNet [18], VGG16 [19], ResNet [20], DenseNet [21], and GoogleNet [22] in their study for the identification of face manipulation.The models demonstrated impressive accuracy when tested on the same manipulation type they were trained on.However, their performance declined when confronted with novel manipulations not part of their training dataset.To address the issues mentioned above, this study adopts the Vision Transformer (ViT) model.The input image is divided into blocks during the general training process, treating each block as a separate entity.The ViT employs self-attention modules to understand the relationships between these embedded patches.The ViT has demonstrated exceptional performance in standard classification tasks by emphasizing important features while reducing the impact of noisy ones through its self-attention mechanism.Inspired by this perspective, this study proposes a deepfake image identification network based on the ViT.The experimental results indicate that the proposed network achieves satisfactory outcomes in deepfake image detection.This research contributes to the field in the following ways:

•
Our primary contribution lies in being the first to address this problem as a multiclassification task.No prior work has tackled this specific aspect, and our study represents a pioneering effort in this area.By approaching deepfake detection through the lens of multi-classification, we aim to enhance the accuracy and efficacy of identifying and categorizing deepfake content, thereby advancing the field's understanding and capabilities in combating this evolving challenge.

•
We have compiled and curated our dataset specifically for multiclass deepfake identification.This dataset is carefully designed to facilitate the training and evaluation of our deepfake detection model, allowing us to explore the complexities of multiclass classification and improve the accuracy of deepfake identification.

•
The proposed fine-tuned ViT model exhibits superior performance to state-of-the-art deepfake identification models.

•
Following an extensive analysis, our research firmly establishes the remarkable robustness and generalizability of the proposed method, surpassing numerous state-of-theart techniques.The findings validate the effectiveness and reliability of our approach in the field of deepfake detection.
The remainder of this paper is divided as follows.Section 2 provides the survey's existing methods, emphasizing the role of the ViT.Section 3 outlines the methodology of the ViT's application, while the experimental results showcase its effectiveness.The discussion interprets findings and outlines future implications for multimedia forensics in Section 4, and Section 5 provides the conclusion of this study.

Related Works
The proliferation of deepfake technology has ushered in a new era of challenges in the realm of multimedia forensics and information veracity.Prior research has underscored the need for innovative methods to detect and combat the manipulation of digital content [23].Early efforts in deepfake detection centered around traditional signal processing and image analysis techniques.Researchers leveraged facial landmarks, inconsistencies in lighting, and unnatural facial movements as indicators of potential manipulation.However, the rapid advancement of GANs led to the creation of more convincing and challengingto-detect deepfakes, necessitating a shift towards more sophisticated detection methods.Akhtar and Dasgupta [24] investigated the feasibility of utilizing local feature descriptors to recognize manipulated faces.Their study presented a comparative experimental analysis of ten local feature descriptors, employing the 'DeepfakeTIMIT' database as a testing ground.
Bekci et al. [25] presented a deepfake detection system that leverages metric learning and steganalysis-rich models to enhance performance against unseen data and manipulations.To evaluate the effectiveness of their approach, an empirical analysis was conducted using openly accessible datasets, including FaceForensics++, DeepFakeTIMIT, and CelebDF.The suggested framework demonstrated significant accuracy improvements ranging from 5% to 15% when faced with concealed modifications.Li et al. [26] investigated the differences in eye-blinking patterns between deepfake videos and those displayed by genuine human subjects.Based on their observations, they developed a novel eye-blinking detection technique tailored to identify deepfake videos specifically.
In their study, Nguyen et al. [27] used the eyebrow region as a set of features to identify deepfake videos.They applied four deep learning methods-LightCNN, Resnet, DenseNet, and SqueezeNet-for this purpose.The UADFV and Celeb-DF datasets produced the highest AUC (Area Under Curve) values of 0.984 and 0.712, respectively.
Patel et al. [28] introduced Trans-DF, a deepfake detection method relying on random forests.The Trans-DF model demonstrated impressive detection accuracy, achieving a high score of 0.902, highlighting its effectiveness in identifying deepfake videos.Another approach was presented by Yang and colleagues, utilizing SVM classifiers to differentiate between deepfake images and videos.Their method capitalized on variations in head poses as essential features for discrimination.Through the implementation of this technique, they created a system with a noteworthy AUROC score of 0.890, effectively detecting and distinguishing deepfake content.
Ciftci et al. [29] presented a pioneering technique to trace the origins of deepfake content by scrutinizing biological cues within residuals.This groundbreaking study marked the inaugural application of biological indicators in the detection of deepfake sources.The researchers performed experimental assessments on the Face Forensics++ dataset, incorporating numerous ablation tests to affirm the validity of their method.Notably, they attained a remarkable accuracy rate of 93.39% in source identification across four distinct deepfake generators.These results emphasize the efficacy of their proposed approach and its promising ability to accurately trace the roots of deepfake content.
In 2022, Yang et al. [30] introduced a deepfake detection model named MSTA_Net, leveraging machine learning techniques.This model specifically examined the texture properties of an image to discern abnormalities indicative of deepfake alterations.Unlike other approaches that focused solely on facial regions, the MSTA_Net model considered the entire image.By establishing connections between manipulated and unmanipulated areas within the image, the model identified irregularities in texture and signaling variations as potentially fake.Conversely, when no irregularities were detected, the image received a non-fake label, suggesting a higher likelihood of authenticity.Their proposed model facilitated the identification of genuine and manipulated images based on their overall texture characteristics.In recent studies, the prominence of multi-attentional and transformer models has grown significantly in the area of deepfake detection [31].Overall, the multi-modal, multi-scale transformer model presented by Wang et al. [32] offers a promising approach to deepfake detection.By enabling the analysis of image patches at different spatial levels and utilizing multiple modalities, the model aims to improve accuracy and robustness in identifying deepfake content.
CNNs have demonstrated remarkable efficacy in detecting deepfake content, underscoring their importance in this field.Despite their proficiency in extracting features from small objects, CNNs may encounter challenges in precisely identifying key regions within an image.Leveraging a ViT model for deepfake identification presents an intriguing and promising alternative.ViTs were originally introduced for image classification tasks and have demonstrated strong performance on various computer vision benchmarks [33].There are many reasons to choose ViTs for this study, of which the main ones are listed below.

•
Attention Mechanism: ViT models utilize self-attention mechanisms, which allow them to capture long-range dependencies within an image.This is crucial for detecting subtle inconsistencies and artifacts that might be present in deepfake images.Deepfake generation often involves stitching or blending different parts of images, and attention mechanisms can help identify these anomalies.

•
Global Context: Classic CNNs are great at pulling out details from specific areas, whereas ViT models take in the complete image as a sequence of patches, allowing them to grasp the global context.This difference can be beneficial for deepfake detection, as it lets the model scrutinize the overall structure and consistency of an image.

•
Robustness to Manipulations: ViT models might exhibit increased robustness to common manipulation techniques used in deepfake generation.Their attention mechanisms can potentially make them more resistant to simple modifications like noise addition or small alterations in pixel values.

•
Interpretable Attention Maps: ViT models generate attention maps that indicate which parts of an image are considered the most important for making predictions.These maps could provide insights into how the model distinguishes between real and deepfake images, aiding in understanding and improving the model's decisionmaking process.

Proposed Methodology
This section outlines and presents the methodologies utilized and proposed to identify fake images accurately.These methods are carefully designed to enhance the precision and effectiveness of detecting and distinguishing fake content from genuine ones.

Dataset
For our experiment, we utilized a dataset sourced from Kaggle [34], an online source [35], Stable Diffusion [36], and the StyleGAN2 encoding of Stable Diffusion [37].We used the free version of TPU (Tensor Processing Unit) that is provided by Google Colab to prepare the dataset as well as for research experiments.

1.
Real Images: We considered Kaggle [34] for real images; due to the limitation of computation power, we considered 10K images from this source.

2.
Online Source: We obtained GAN-based fake images from an online source [35].This source consistently provides new fake images with each visit, enabling us to access a diverse and up-to-date dataset for our analysis and experimentation.1) [36].In Equation 1, models can be understood as a series of equally weighted denoising autoencoders, denoted as ε ϑ (x t , t) for t = 1...T.These autoencoders are trained to predict a denoised version of their input, where x t represents a noisy version of the input x.

StyleGAN2 encoding of Stable Diffusion:
This dataset is available on Kaggle [37] with the name Synthetic Faces High Quality (SFHQ).This dataset comprises high-quality 1024 × 1024 curated face images.It was created through a multi-step process.Firstly, a significant number of "text to image" generations were generated, primarily using Stable Diffusion v2.1, along with some from Stable Diffusion v1.4 models.Subsequently, a set of photo-realistic candidate images was generated by encoding these images into the latent space of StyleGAN2 and applying a small manipulation to enhance each image into a high-quality, photo-realistic candidate.This process ensured that the dataset contained diverse and visually appealing face images, enabling us to conduct comprehensive and accurate analyses in our research.The styleGAN2 is mathematically based on a generator network (G), mapping vector (F), noise vector (z), conditional vector (y), and style vector (s) to produce the synthesized image; see Equation ( 2) that is used to synthesize the image x.
Computers 2024, 13, x FOR PEER REVIEW 6 of 19

StyleGAN2 encoding of Stable Diffusion:
This dataset is available on Kaggle [37] with the name Synthetic Faces High Quality (SFHQ).This dataset comprises high-quality 1024 × 1024 curated face images.It was created through a multi-step process.Firstly, a significant number of "text to image" generations were generated, primarily using Stable Diffusion v2.1, along with some from Stable Diffusion v1.4 models.Subsequently, a set of photo-realistic candidate images was generated by encoding these images into the latent space of StyleGAN2 and applying a small manipulation to enhance each image into a high-quality, photo-realistic candidate.This process ensured that the dataset contained diverse and visually appealing face images, enabling us to conduct comprehensive and accurate analyses in our research.The styleGAN2 is mathematically based on a generator network (G), mapping vector (F), noise vector (z), conditional vector (), and style vector (s) to produce the synthesized image; see Equation ( 2) that is used to synthesize the image x.
The style vector (s) is computed with mapping network (F) with Equation (3).
In the context of StyleGAN2, the generator G and the mapping network F are trained to generate high-quality images by considering the style information (s) along with noise (z) and conditioning () inputs.
In our research, we have ultimately focused on four distinct classes and taken the initiative to address the deepfake detection problem using a multiclass approach.By considering multiple classes (Real: 10,000, GAN_Fake: 10,000, Diffusion_Fake: 10,000, and Stable&Gan_Fake: 10,000), we aim to enhance the precision and reliability of our deepfake detection model, accommodating a broader range of deepfake variations and increasing its potential for real-world applications.
To overcome the challenge of class imbalance and potential model bias, we meticulously prepared the dataset in a balanced format.By ensuring each class has a similar representation, we aim to create a more equitable training environment for our deepfake detection model.This approach helps mitigate the impact of overrepresented or underrepresented classes, leading to a fairer and more robust model capable of accurately x = G(z, y, s) The style vector (s) is computed with mapping network (F) with Equation (3).
In the context of StyleGAN2, the generator G and the mapping network F are trained to generate high-quality images by considering the style information (s) along with noise (z) and conditioning (y) inputs.
In our research, we have ultimately focused on four distinct classes and taken the initiative to address the deepfake detection problem using a multiclass approach.By considering multiple classes (Real: 10,000, GAN_Fake: 10,000, Diffusion_Fake: 10,000, and Stable&Gan_Fake: 10,000), we aim to enhance the precision and reliability of our deepfake detection model, accommodating a broader range of deepfake variations and increasing its potential for real-world applications.
To overcome the challenge of class imbalance and potential model bias, we meticulously prepared the dataset in a balanced format.By ensuring each class has a similar representation, we aim to create a more equitable training environment for our deepfake detection model.This approach helps mitigate the impact of overrepresented or underrepresented classes, leading to a fairer and more robust model capable of accurately identifying deepfake content across all classes.Sample images from the prepared dataset can be found in Table 1.identifying deepfake content across all classes.Sample images from the can be found in Table 1.

ViT Architecture
In this section, we introduce the ViT framework, delving into i structure, self-attention mechanism, multi-headed self-attention, and foundations that shape its design.The ViT emerged in 2020 [38] as a gro adigm in computer vision, revealing its potential to redefine our appro ysis and comprehension.Initially rooted in the Transformer architectur ral language processing, the ViT introduces a novel concept by treat quences of tokens, commonly represented by image patches.With the tr ViT adeptly processes these token sequences, enabling effective image derstanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility.The fo former architecture has demonstrated remarkable success across diver picture restoration and object detection.This underscores the broad ap fectiveness of the ViT framework, positioning it as a potent tool in the vision with the potential to revolutionize our approach to image-relate Tokenization and embedding stand as crucial steps within the When handling the input image, it undergoes initial division into a gr ping patches.Subsequently, these patches are flattened and transform dimensional space through a linear operation, followed by normaliza endows the ViT model with the capability to capture both global and from the image, promoting comprehensive learning.It enables the m grasp the intricate features and context of the image.The synergy betw and embedding plays a pivotal role in empowering ViT to excel in a va vision tasks.
The ViT architecture can be mathematically represented by assu image patches extracted from the input image.Each patch is a vector r tion of the image.The set of patches () is represented in Equation ( number of patches.

ViT Architecture
In this section, we introduce the ViT framework, delving into its core structure, self-attention mechanism, multi-headed self-attention, and the ma foundations that shape its design.The ViT emerged in 2020 [38] as a groundbr adigm in computer vision, revealing its potential to redefine our approach to i ysis and comprehension.Initially rooted in the Transformer architecture crafte ral language processing, the ViT introduces a novel concept by treating im quences of tokens, commonly represented by image patches.With the transform ViT adeptly processes these token sequences, enabling effective image analy derstanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility.The foundat former architecture has demonstrated remarkable success across diverse tasks picture restoration and object detection.This underscores the broad applicabi fectiveness of the ViT framework, positioning it as a potent tool in the field o vision with the potential to revolutionize our approach to image-related tasks Tokenization and embedding stand as crucial steps within the ViT a When handling the input image, it undergoes initial division into a grid of no ping patches.Subsequently, these patches are flattened and transformed int dimensional space through a linear operation, followed by normalization.T endows the ViT model with the capability to capture both global and local i from the image, promoting comprehensive learning.It enables the model to grasp the intricate features and context of the image.The synergy between to and embedding plays a pivotal role in empowering ViT to excel in a variety o vision tasks.
The ViT architecture can be mathematically represented by assuming  image patches extracted from the input image.Each patch is a vector represen tion of the image.The set of patches () is represented in Equation ( 4), whe number of patches.

ViT Architecture
In this section, we introduce the ViT framework, delving into its core principle structure, self-attention mechanism, multi-headed self-attention, and the mathematic foundations that shape its design.The ViT emerged in 2020 [38] as a groundbreaking pa adigm in computer vision, revealing its potential to redefine our approach to image ana ysis and comprehension.Initially rooted in the Transformer architecture crafted for nat ral language processing, the ViT introduces a novel concept by treating images as s quences of tokens, commonly represented by image patches.With the transformer desig ViT adeptly processes these token sequences, enabling effective image analysis and u derstanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility.The foundational tran former architecture has demonstrated remarkable success across diverse tasks, includin picture restoration and object detection.This underscores the broad applicability and e fectiveness of the ViT framework, positioning it as a potent tool in the field of comput vision with the potential to revolutionize our approach to image-related tasks [39].
Tokenization and embedding stand as crucial steps within the ViT architectur When handling the input image, it undergoes initial division into a grid of non-overla ping patches.Subsequently, these patches are flattened and transformed into a highe dimensional space through a linear operation, followed by normalization.This metho endows the ViT model with the capability to capture both global and local informatio from the image, promoting comprehensive learning.It enables the model to effective grasp the intricate features and context of the image.The synergy between tokenizatio and embedding plays a pivotal role in empowering ViT to excel in a variety of comput vision tasks.
The ViT architecture can be mathematically represented by assuming  is a set image patches extracted from the input image.Each patch is a vector representing a po tion of the image.The set of patches () is represented in Equation ( 4), where N is t number of patches.

•
Patch Embedding: The image patches ( ,  …  ) are linearly projected to an em bedding space by a linear transformation Wpatch (see Equation ( 5)).
identifying deepfake content across all classes.Sample images from the prepared dataset can be found in Table 1.Real GAN_Fake Diffusion Fake Stable&GAN Fake

ViT Architecture
In this section, we introduce the ViT framework, delving into its core principles, structure, self-attention mechanism, multi-headed self-attention, and the mathematical foundations that shape its design.The ViT emerged in 2020 [38] as a groundbreaking paradigm in computer vision, revealing its potential to redefine our approach to image analysis and comprehension.Initially rooted in the Transformer architecture crafted for natural language processing, the ViT introduces a novel concept by treating images as sequences of tokens, commonly represented by image patches.With the transformer design, ViT adeptly processes these token sequences, enabling effective image analysis and understanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility.The foundational transformer architecture has demonstrated remarkable success across diverse tasks, including picture restoration and object detection.This underscores the broad applicability and effectiveness of the ViT framework, positioning it as a potent tool in the field of computer vision with the potential to revolutionize our approach to image-related tasks [39].
Tokenization and embedding stand as crucial steps within the ViT architecture.When handling the input image, it undergoes initial division into a grid of non-overlapping patches.Subsequently, these patches are flattened and transformed into a higherdimensional space through a linear operation, followed by normalization.This method endows the ViT model with the capability to capture both global and local information from the image, promoting comprehensive learning.It enables the model to effectively grasp the intricate features and context of the image.The synergy between tokenization and embedding plays a pivotal role in empowering ViT to excel in a variety of computer vision tasks.
The ViT architecture can be mathematically represented by assuming  is a set of image patches extracted from the input image.Each patch is a vector representing a portion of the image.The set of patches () is represented in Equation ( 4), where N is the number of patches.
The ViT model consists of several components that are enlisted below (also see Figure 2).

•
Patch Embedding: The image patches ( ,  …  ) are linearly projected to an embedding space by a linear transformation Wpatch (see Equation ( 5)).

ViT Architecture
In this section, we introduce the ViT framework, delving into its core principles, structure, self-attention mechanism, multi-headed self-attention, and the mathematical foundations that shape its design.The ViT emerged in 2020 [38] as a groundbreaking paradigm in computer vision, revealing its potential to redefine our approach to image analysis and comprehension.Initially rooted in the Transformer architecture crafted for natural language processing, the ViT introduces a novel concept by treating images as sequences of tokens, commonly represented by image patches.With the transformer design, ViT adeptly processes these token sequences, enabling effective image analysis and understanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility.The foundational transformer architecture has demonstrated remarkable success across diverse tasks, including picture restoration and object detection.This underscores the broad applicability and effectiveness of the ViT framework, positioning it as a potent tool in the field of computer vision with the potential to revolutionize our approach to image-related tasks [39].
Tokenization and embedding stand as crucial steps within the ViT architecture.When handling the input image, it undergoes initial division into a grid of non-overlapping patches.Subsequently, these patches are flattened and transformed into a higher-dimensional space through a linear operation, followed by normalization.This method endows the ViT model with the capability to capture both global and local information from the image, promoting comprehensive learning.It enables the model to effectively grasp the intricate features and context of the image.The synergy between tokenization and embedding plays a pivotal role in empowering ViT to excel in a variety of computer vision tasks.
The ViT architecture can be mathematically represented by assuming X is a set of image patches extracted from the input image.Each patch is a vector representing a portion of the image.The set of patches (X) is represented in Equation ( 4), where N is the number of patches.X = {x 1 , x 2 , x 3 , . . . . x N } The ViT model consists of several components that are enlisted below (also see Figure 2).
• Patch Embedding: The image patches (x 1 , x 2 . . .x N ) are linearly projected to an embedding space by a linear transformation W patch (see Equation ( 5)).
• Transformer Encoder: The transformer encoder processes the positional embeddings E pos .This encoder comprises several layers, each incorporating self-attention mechanisms and feedforward neural networks.The result of this encoding is a collection of contextualized embeddings, as depicted in Equation ( 6).Equation ( 7), (z 1 , z 2 , . . . . . ., z N ), represents the output representations or embeddings produced by the Transformer encoder for each position in the input sequence.
Trans f ormerEncoder(E POS ) = {z 1 , z 2 , . . . . . . ,z N } • Classification Head: The final contextualized embeddings Z are used for downstream tasks.In classification tasks, a classification head takes the average or a specific token's embedding (e.g., classification token) from Z and passes it through one or more fully connected layers to make predictions.

ViT Hyper-Parameters
In this study, the initial images undergo preprocessing and are divided into patches measuring 16 × 16 pixels, subsequently scaled to 224 × 224 pixels.This reduction technique involves breaking down the image into smaller fixed-size patches, each with dimensions of 16 pixels in width and 16 pixels in height.
The model employed in this study underwent training on a substantial dataset known as ImageNet-21k.This dataset encompasses around 14 million photos, categorized into 21,841 distinct classes, making it specifically tailored for extensive image classification tasks.The model's architecture comprises 12 transformer layers, each housing 768 hidden components.Its overall capacity is reflected in its 85.8 million trainable parameters, which play a significant role in the learning process.For a comprehensive understanding, the values and configurations of the parameters used in the ViT model are detailed in Table 2.The ViT design centers around the Multi-head Self-Attention (MSA) mechanism, which plays a pivotal role in the model's capabilities.MSA empowers the ViT to attend to multiple parts of the image simultaneously.It consists of distinct "heads", with each head independently computing attention.By focusing on different regions of the image, these attention heads produce various representations, which are then concatenated to generate the final image representation.This approach enables the ViT to capture intricate interactions between input elements by attending to multiple sections simultaneously.However, this enhancement comes at the cost of increased complexity and computational requirements.The utilization of multiple attention heads and the subsequent aggregation of their outputs necessitate more computational resources.The mathematical representation of MSA can be seen in Equation (8).
In Equation ( 7), Q, K, and V stand for the query, key, and value matrices, respectively.The H 1 , H 2 ,. . .H n represents the output of multiple attention heads.In the context of neural networks, particularly in transformers, a multi-head attention mechanism involves using multiple sets of attention weights (attention heads) to capture different aspects of relationships in the input data.Each H i is the output of the i-th attention head.The self-attention mechanism plays a pivotal role in transformers, serving as the foundational component for explicitly modeling interactions and relationships across all sequences in prediction tasks.Unlike CNNs, which depend on local receptive fields, the self-attention layer gathers insights and features from the entire input sequence, allowing it to capture both local and global information.This unique characteristic distinguishes self-attention from CNNs, as it promotes a more comprehensive interpretation and representation of information, leading to improved performance in various sequence-based tasks.
The attention mechanism involves computing the dot product between the query and key vectors, followed by normalization using SoftMax.Subsequently, it modulates the value vectors to generate an enhanced output representation, a task carried out in the CLS block.Figure 2 is the base abstract architectural diagram of the ViT model [38].

ViT Hyper-Parameters
In this study, the initial images undergo preprocessing and are divided into patches measuring 16 × 16 pixels, subsequently scaled to 224 × 224 pixels.This reduction technique involves breaking down the image into smaller fixed-size patches, each with dimensions of 16 pixels in width and 16 pixels in height.
The model employed in this study underwent training on a substantial dataset known as ImageNet-21k.This dataset encompasses around 14 million photos, categorized into 21,841 distinct classes, making it specifically tailored for extensive image classification tasks.The model's architecture comprises 12 transformer layers, each housing 768 hidden components.Its overall capacity is reflected in its 85.8 million trainable parameters, which play a significant role in the learning process.For a comprehensive understanding, the values and configurations of the parameters used in the ViT model are detailed in Table 2. Figure 3 showcases the abstract-level diagram illustrating the proposed methodology.This diagram provides an overview of the key components and steps involved (dataset preparation, preprocessing, splitting, model tuning, training, and evaluation) in our approach, offering a visual representation of how our method operates and achieves its objectives.approach, offering a visual representation of how our method operates and achieves its objectives.

CNN Architecture-Based Pretrained Models
The primary objective of this study was to uncover and identify the most recently manipulated deepfake images, specifically those generated using Stable Diffusion and StyleGAN2.This research stands out as a pioneering effort not only in recognizing these cutting-edge manipulated fake images but also in addressing the challenge in a multiclass context.
To demonstrate the effectiveness of patch technology over traditional CNN and CNN-based pretrained models such as VGG16 and ResNet50, this study employed a finetuning approach.The models were preloaded with weights from the ImageNet dataset using a weight transfer technique.In this process, the network layers were frozen, and the last fully connected layers were omitted from the architectures.
To adapt these models for our purposes, a flattened layer was introduced to eliminate the fully connected layers, and dense layers with four neurons were added.The activation function was set to SoftMax to tackle the multiclass nature of the problem.This nuanced approach aims to showcase that, in the realm of manipulated deepfake image detection, patch technology can outperform the more conventional CNN and pretrained models.The local feature extraction is the main reason for selecting CNN-based models.

Experiment Results and Discussion
In this section, we present a comprehensive discussion of the evaluation measures, experimental details, and the results obtained through the proposed methodology.We delve into the assessment criteria used to gauge the performance of our approach, provide insights into the experimental setup and configurations, and present the outcomes achieved during our evaluation process.

Evaluation Metrics
In the realm of machine learning and deep learning, evaluation metrics play a vital role in gauging model performance.These measures are fundamental in statistical research and are essential in assessing the effectiveness of our proposed model.In this study, we emphasized the following key assessment measures [40] to evaluate the efficacy of our approach.In Equations ( 9)-( 12), TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

CNN Architecture-Based Pretrained Models
The primary objective of this study was to uncover and identify the most recently manipulated deepfake images, specifically those generated using Stable Diffusion and StyleGAN2.This research stands out as a pioneering effort not only in recognizing these cutting-edge manipulated fake images but also in addressing the challenge in a multiclass context.
To demonstrate the effectiveness of patch technology over traditional CNN and CNNbased pretrained models such as VGG16 and ResNet50, this study employed a fine-tuning approach.The models were preloaded with weights from the ImageNet dataset using a weight transfer technique.In this process, the network layers were frozen, and the last fully connected layers were omitted from the architectures.
To adapt these models for our purposes, a flattened layer was introduced to eliminate the fully connected layers, and dense layers with four neurons were added.The activation function was set to SoftMax to tackle the multiclass nature of the problem.This nuanced approach aims to showcase that, in the realm of manipulated deepfake image detection, patch technology can outperform the more conventional CNN and pretrained models.The local feature extraction is the main reason for selecting CNN-based models.

Experiment Results and Discussion
In this section, we present a comprehensive discussion of the evaluation measures, experimental details, and the results obtained through the proposed methodology.We delve into the assessment criteria used to gauge the performance of our approach, provide insights into the experimental setup and configurations, and present the outcomes achieved during our evaluation process.

Evaluation Metrics
In the realm of machine learning and deep learning, evaluation metrics play a vital role in gauging model performance.These measures are fundamental in statistical research and are essential in assessing the effectiveness of our proposed model.In this study, we emphasized the following key assessment measures [40] to evaluate the efficacy of our approach.In Equations ( 9)-( 12), TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

•
Accuracy: Accuracy is a metric that assesses the overall correctness of the model's predictions.It calculates the proportion of correctly classified samples out of the total samples.While accuracy is a crucial evaluation measure, it may not be sufficient in certain scenarios, such as imbalanced datasets or cases where different types of errors have varying consequences.In such situations, additional evaluation metrics may be necessary to provide a more comprehensive understanding of the model's performance and capabilities.In our evaluation of the proposed model, we also employed class-wise precision, recall, and F1 score to assess its performance, as presented in Table 4.The support column indicates the number of samples available for each class in the testing dataset.For example, the Real class consists of 997 samples, and the Diffusion Fake class also comprises 997 samples for testing purposes.The total sum of the support column equals 4000, representing the total number of samples tested in our evaluation.By analyzing these class-wise metrics, we can understand the model's effectiveness in correctly classifying different classes and its overall performance across the entire dataset.5 showcases the actual and predicted labels with the ViT model.The table contains three columns: "Images", "Predicted", and "Actual".Each row corresponds to a different image.The "Predicted" column displays the labels that the ViT model assigned to the images after analyzing them, while the "Actual" column shows the true labels.Table 5 also demonstrates that the ViT model accurately predicted the labels for all the tested images, with its predictions matching the actual labels except for the last image.The excessive use of filters and a side pose could be the reasons for misclassification.This suggests that the ViT model is effective in classifying different image types based on the provided data.Furthermore, to test any image in the future, please follow the steps outlined in the https://github.com/Muhammad-Asad-Arshed/MultiClass_DeepFake.git(accessed on 25 November 2023) repository.In our evaluation of the proposed model, we also employed class-wise precision, recall, and F1 score to assess its performance, as presented in Table 4.The support column indicates the number of samples available for each class in the testing dataset.For example, the Real class consists of 997 samples, and the Diffusion Fake class also comprises 997 samples for testing purposes.The total sum of the support column equals 4000, representing the total number of samples tested in our evaluation.By analyzing these class-wise metrics, we can understand the model's effectiveness in correctly classifying different classes and its overall performance across the entire dataset.Table 5 showcases the actual and predicted labels with the ViT model.The table contains three columns: "Images", "Predicted", and "Actual".Each row corresponds to a different image.The "Predicted" column displays the labels that the ViT model assigned to the images after analyzing them, while the "Actual" column shows the true labels.Table 5 also demonstrates that the ViT model accurately predicted the labels for all the tested images, with its predictions matching the actual labels except for the last image.The excessive use of filters and a side pose could be the reasons for misclassification.This suggests that the ViT model is effective in classifying different image types based on the provided data.Furthermore, to test any image in the future, please follow the steps outlined in the https://github.com/Muhammad-Asad-Arshed/MultiClass_DeepFake.git(accessed on 25 November 2023) repository.

Images
Predicted Actual 1. Our proposed model weights will be downloaded.
2. Install the necessary libraries.
3. Extract the model weights that are in the RAR file and load the model.4. Upload an image to Google Colab and set the path in the "img" variable.5. Run the "Prediction" cell to get the class.

Comparison with CNN-Based Pretrained Architectures
To demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.

GAN_Fake GAN_Fake
Computers 2024, 13, x FOR PEER REVIEW 13 of 19 1. Our proposed model weights will be downloaded.
2. Install the necessary libraries.
3. Extract the model weights that are in the RAR file and load the model.4. Upload an image to Google Colab and set the path in the "img" variable.5. Run the "Prediction" cell to get the class.

Comparison with CNN-Based Pretrained Architectures
To demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.

Diffusion_Fake Diffusion_Fake
Computers 2024, 13, x FOR PEER REVIEW 13 of 19 1. Our proposed model weights will be downloaded.
2. Install the necessary libraries.
3. Extract the model weights that are in the RAR file and load the model.4. Upload an image to Google Colab and set the path in the "img" variable.5. Run the "Prediction" cell to get the class.

Comparison with CNN-Based Pretrained Architectures
To demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.

Stable&GAN_Fake Stable&GAN_Fake
1. Our proposed model weights will be downloaded.
2. Install the necessary libraries.
3. Extract the model weights that are in the RAR file and load the model.4. Upload an image to Google Colab and set the path in the "img" variable.5. Run the "Prediction" cell to get the class.

Comparison with CNN-Based Pretrained Architectures
To demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.
1. Our proposed model weights will be downloaded.2. Install the necessary libraries.
3. Extract the model weights that are in the RAR file and load the model.4. Upload an image to Google Colab and set the path in the "img" variable.5. Run the "Prediction" cell to get the class.Our proposed model weights will be downloaded.

2.
Install the necessary libraries.

3.
Extract the model weights that are in the RAR file and load the model.

4.
Upload an image to Google Colab and set the path in the "img" variable.5.
Run the "Prediction" cell to get the class.

Comparison with CNN-Based Pretrained Architectures
To demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.
We achieved a training accuracy of 0.77, a train accuracy of 0.78, and a test accuracy of 0.77 with a fine-tuned ResNet-50 model [20].The graphical representation of the learning graph can be seen in Figure 5.We achieved a training accuracy of 0.77, a train accuracy of 0.78, and a test accuracy of 0.77 with a fine-tuned ResNet-50 model [20].The graphical representation of the learning graph can be seen in Figure 5.  6 and Figure 6, which illustrate the efficacy and reliability of the VGG-16 model.6 and Figure 6, which illustrate the efficacy and reliability of the VGG-16 model.We achieved a training accuracy of 0.77, a train accuracy of 0.78, and a test accuracy of 0.77 with a fine-tuned ResNet-50 model [20].The graphical representation of the learning graph can be seen in Figure 5.

Model Train Accuracy
Validation Accuracy Accuracy Precision Recall F1

Comparison with the Literature Contributions
It is important to acknowledge that a direct comparison with existing studies in the field of deepfake identification may not be feasible due to the unique nature of our research.Our study is foundational work specifically focusing on predicting deepfakes as a multiclass problem.In contrast, most existing studies are based on binary classification ( [41][42][43][44][45][46][47]) distinguishing between real and fake images (see Table 7).As our approach tackles the complex task of multiclass deepfake identification, it introduces novel challenges and considerations that differentiate it from previous research.Therefore, caution should be exercised when drawing direct comparisons with binary-based studies, as these approaches' contexts and objectives differ significantly.Our research seeks to contribute to the field by exploring the capabilities and limitations of multiclass deepfake detection, paving the way for further advancements in this emerging study area.Additionally, this study holds significant importance by pioneering a multi-classification approach to deepfake detection, a previously unexplored aspect, thereby advancing the field's understanding and effectiveness in countering evolving deepfake challenges.The creation of a dedicated dataset for multiclass deepfake identification facilitates enhanced model training and accuracy.Introducing a fine-tuned ViT model that surpasses state-ofthe-art techniques underscores the research's advancements.Moreover, this study establishes the proposed method's robustness and generalizability through extensive analysis, reinforcing its reliability for combating diverse deepfake scenarios and content types.

Implications
This study introduces a novel theoretical perspective by framing deepfake detection as a multiclass task, acknowledging the diversity in manipulation techniques like Stable Diffusion and StyleGAN2.The application of ViT for global feature extraction represents a theoretical advancement, expanding beyond traditional CNNs.Recognizing and addressing challenges posed by advanced techniques contributes to a nuanced understanding of deepfake intricacies.On a practical level, the proposed ViT-based method demonstrates exceptional accuracy (99.90%) on a multiclass-prepared dataset, highlighting its robustness in countering deepfake threats.The comparison with state-of-the-art CNN models provides a practical benchmark, emphasizing the ViT's superiority and contributing significantly to a more secure digital landscape.

Conclusions
Deepfakes have emerged as a prominent technique for disseminating misinformation and manipulating visual content.While not all deepfake creations are inherently malicious, it is essential to identify and address such content, as some instances can pose significant threats to society.In this study, we focused on the critical task of multiclass deepfake identification and evaluated the effectiveness of the ViT in detecting deepfake images.The inherent global feature mapping and self-attention mechanisms of the ViT proved to be highly effective in discerning deepfake content.Through rigorous evaluation across various image manipulation and generation techniques, our approach achieved an exceptional accuracy of 99.90%.These results highlight the ViT's potential to combat deepfake content and promote trust and integrity in digital media.Our research endeavors will focus on expanding the scope of our current work by incorporating additional datasets specifically curated and released for deepfake research.This expansion is essential to enhance the diversity, accuracy, and overall robustness of our methods and findings and to address the ever-evolving challenges posed by deepfake technology.Our ongoing efforts strive to contribute to the advancement of deepfake detection and contribute to building a more secure and trustworthy digital landscape.

Figure 1 .
Figure 1.Sample diagram of text-to-image generation with stable diffusion.

Figure 1 .
Figure 1.Sample diagram of text-to-image generation with stable diffusion.

Computers 2024 ,
13, x FOR PEER REVIEW 9 of 19

Figure 3 .
Figure 3. Abstract-level diagram of the proposed methodology.

Figure 3 .
Figure 3. Abstract-level diagram of the proposed methodology.

Figure 4 . 19 in Figure 4 .
Figure 4.These graphs provide valuable insights into the model's learning progress and ability to optimize the training process, ultimately improving performance and accuracy.

Figure 4 .
Figure 4. Training and validation loss of ViT model.

Figure 4 .
Figure 4. Training and validation loss of ViT model.

4. 2 . 1 .
Comparison with CNN-Based Pretrained ArchitecturesTo demonstrate its robustness and highlight the effectiveness of global feature extraction in deepfake identification over local feature extraction, our proposed model was meticulously evaluated against established CNN-based models.This comparison serves to underscore the model's capability in capturing comprehensive patterns across the entire dataset, emphasizing its potential superiority in discerning deepfake content.

Computers 2024 ,
13, x FOR PEER REVIEW 14 of 19

Figure 5 .
Figure 5. Training and validation graph of fine-tuned ResNet-50 model.The fine-tuned VGG-16 model [19] has demonstrated noteworthy performance, achieving a training accuracy of 0.95 and a validation accuracy of 0.93 compared to the ResNet-50 model.The model's effectiveness extends to the test dataset, where it maintains a robust accuracy of 0.94.For a comprehensive visual representation of these results, see Table6and Figure6, which illustrate the efficacy and reliability of the VGG-16 model.

Figure 5 .
Figure 5. Training and validation graph of fine-tuned ResNet-50 model.The fine-tuned VGG-16 model [19] has demonstrated noteworthy performance, achieving a training accuracy of 0.95 and a validation accuracy of 0.93 compared to the ResNet-50 model.The model's effectiveness extends to the test dataset, where it maintains a robust accuracy of 0.94.For a comprehensive visual representation of these results, see Table6and Figure6, which illustrate the efficacy and reliability of the VGG-16 model.

Figure 5 .
Figure 5. Training and validation graph of fine-tuned ResNet-50 model.The fine-tuned VGG-16 model[19] has demonstrated noteworthy performance, achieving a training accuracy of 0.95 and a validation accuracy of 0.93 compared to the ResNet-50 model.The model's effectiveness extends to the test dataset, where it maintains a robust accuracy of 0.94.For a comprehensive visual representation of these results, see Table6and Figure6, which illustrate the efficacy and reliability of the VGG-16 model.

Figure 6 .Table 6 .
Figure 6.Training and validation graph of fine-tuned VGG-16 model.Table 6. Proposed model comparison with local feature extraction-based pretrained models (based on two decimal place evaluation scores).

Table 5 .
Actual vs. predicted using ViT model.

Table 5 .
Actual vs. predicted using ViT model.

Table 5 .
Actual vs. predicted using ViT model.

Table 5 .
Actual vs. predicted using ViT model.

Table 5 .
Actual vs. predicted using ViT model.

Table 5 .
Actual vs. predicted using ViT model.

Table 6 .
Proposed model comparison with local feature extraction-based pretrained models (based on two decimal place evaluation scores).

Table 6 .
Proposed model comparison with local feature extraction-based pretrained models (based on two decimal place evaluation scores).

Table 7 .
Analysis of the proposed study in contrast to existing state-of-the-art studies.