A Robust Approach to Multimodal Deepfake Detection

The widespread use of deep learning techniques for creating realistic synthetic media, commonly known as deepfakes, poses a significant threat to individuals, organizations, and society. As the malicious use of these data could lead to unpleasant situations, it is becoming crucial to distinguish between authentic and fake media. Nonetheless, though deepfake generation systems can create convincing images and audio, they may struggle to maintain consistency across different data modalities, such as producing a realistic video sequence where both visual frames and speech are fake and consistent one with the other. Moreover, these systems may not accurately reproduce semantic and timely accurate aspects. All these elements can be exploited to perform a robust detection of fake content. In this paper, we propose a novel approach for detecting deepfake video sequences by leveraging data multimodality. Our method extracts audio-visual features from the input video over time and analyzes them using time-aware neural networks. We exploit both the video and audio modalities to leverage the inconsistencies between and within them, enhancing the final detection performance. The peculiarity of the proposed method is that we never train on multimodal deepfake data, but on disjoint monomodal datasets which contain visual-only or audio-only deepfakes. This frees us from leveraging multimodal datasets during training, which is desirable given their lack in the literature. Moreover, at test time, it allows to evaluate the robustness of our proposed detector on unseen multimodal deepfakes. We test different fusion techniques between data modalities and investigate which one leads to more robust predictions by the developed detectors. Our results indicate that a multimodal approach is more effective than a monomodal one, even if trained on disjoint monomodal datasets.


Introduction
Recent advances in deep learning and new media technologies have made the creation and sharing of multimedia content more accessible than ever.Users can now generate super realistic synthetic images, videos and speech tracks with minimal effort and without requiring any particular skill.The growth of these technologies can have a twofold effect.On one side, such techniques allow consumers to explore new creative and artistic possibilities and introduce applications that make everyday life easier.On the other hand, they can also lead to dangers and threats when misused.An example of the latter case are deepfakes, synthetic multimedia content generated through deep learning techniques that depict individuals in actions and behaviors that are not their own.
Deepfakes have already been used for several malicious purposes, including the publication of fabricated results in scientific journals [1] or the attack of the identity tests used by banks through synthetic voices [2] and videos [3], raising concerns about them and their use.In response to this phenomenon, the research community has prioritized the development of algorithms to discriminate real content from deepfakes [4].Several approaches have been proposed and multiple deepfake databases have been created to push the research in this direction.Since deepfake technologies continue to advance and produce more realistic results, developing detection methods based on diverse strategies and operating principles is crucial to combat this issue.
Focusing on the analysis of video sequences, the scientific community has put forward methods for detecting deepfakes by analyzing both their audio and visual contents, as the deepfake phenomenon has impacted each of these [5].However, while the developed detectors can demonstrate impressive performance in controlled environments, their effectiveness is somehow limited in other scenarios.For instance, most of the classifiers are monomodal, meaning that they take into account only one data modality (i.e., either visual or audio) at a time, which makes them ineffective against certain types of deepfake videos.
Visual-only detectors, for example, can be deceived by audio deepfakes, while audioonly detectors are vulnerable to deepfakes that manipulate visual content [6].Furthermore, some information is lost during these analyses, such as the consistency between modalities, which is sometimes crucial for detecting synthetic content.To overcome these limitations, multimodal approaches have been recently proposed, able to combine information from various domains to enhance the accuracy of the detection process [7,8].
Despite their excellent performances, even multimodal methods are not immune to the problem of robustness.This refers to the ability of the detector to maintain high accuracy also when processing new unseen data, different from those used in training.This aspect is crucial in multimedia forensics, as it improves the applications of the developed systems in real-world scenarios.To address the robustness issue, researchers have explored several aspects, such as considering detectors based on different approaches and using a variety of datasets in training.
For instance, there exists a set of detectors known as semantic, which base their predictions on high-level aspects of the media under analysis [9,10].The rationale behind these methods is that deepfake generators can reproduce low-level features but struggle with more complex aspects, making it possible to differentiate between real and synthetic data.Furthermore, these high-level features are less subject to post-processing operations applied to the data and domain changes, allowing for more robust and reliable predictions.
Regarding the use of different training datasets, it helps the developed detector not to overfit a single data type but to generalize as much as possible, improving the robustness of the final model.However, in the current literature, it is common practice to train and test the developed detectors on subsets of data extracted from the same dataset [11].This practice can be deceptive since the high performance achieved may not be reflected when the methods are tested on different datasets.Cross-dataset tests are needed to assess the actual discrimination capabilities of the detectors.
Moreover, all the currently proposed multimodal detectors have been trained on multimodal datasets, thus requiring the presence of data of this type during the training phase.This poses an additional challenge since there is a lack of multimodal deepfake datasets proposed in the literature, while monomodal ones are widely available.For instance, the literature reports several deepfake audio datasets not including any visual content.Deepfake video datasets are available as well, though the audio tracks related to the synthetic video sequences are often taken from original speech.
In this paper, we present a new multimodal video deepfake detection method that combines visual and audio information.To determine the authenticity of the input video sequence, we combine a set of data-driven features extracted from the visual content with a set of speaker-identity features extracted from the audio content.
The peculiarity of the proposed detector is that its training phase does not take place on multimodal deepfake data but on monomodal samples.In other words, we never train our detector over video sequences that contain fully-synthetic data, i.e., where both visual and audio contents are deepfakes.During the training phase, we combine the features derived from synthetic audio and synthetic visual data extracted from disjoint monomodal datasets, meaning that we do not require any additional material with respect to training standard monomodal detectors.
We evaluate the performance of our method on several state-of-the-art multimodal video deepfake datasets by considering various fusion strategies between the two modalities.Our results show that a multimodal approach is equally more functional and robust than a monomodal one.The results show the effectiveness and robustness of the proposed approach, indicating high generalization capabilities on unseen data.
The rest of the paper is structured as follows.Section 2 provides the reader with some knowledge regarding detection methods for audio and video deepfakes.Section 3 explains the details of the tackled problem and the proposed methods to fuse the audio and visual modalities.Section 4 describes the experimental setup used to validate the presented system, including details on the considered datasets.Section 5 collects all the achieved results providing detailed comments.Finally, Section 6 concludes the paper and outlines possible future works.

Deepfake Detection
In this section we introduce the reader to the deepfake detection task, providing a literature overview for the visual-only, audio-only and audio-visual deepfake detection scenarios.

Visual-Only Deepfake Detection
The rising of deepfake generation methods has posed a growing threat, leading to the development of numerous techniques to detect counterfeit videos and mitigate the damage they can cause.Generally, detection techniques leveraging visual content can be grouped into two categories, based on the approach they consider.The first group relies on manually-crafted features, while the second makes use of deep learning-based features.
Early forgery detection methods primarily depend on handcrafted features such as facial landmarks [12][13][14], optical flow [15] and various digital image processing techniques designed to enhance the visibility of artifacts [16].
With the advancement of video deepfake generation techniques and the higher quality of produced media, detecting deepfake video frames is becoming increasingly challenging using standard methods.Consequently, researchers have begun applying Deep Neural Networks (DNNs) with powerful feature extraction capabilities, aiming for more accurate and reliable detection processes with implicit feature learning.
As an example, the authors of [17,18] are pioneers in using DNNs to extract deep features from video frames.In [19] Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models are combined to detect deepfake videos generated using face-swapping techniques.The authors of [20] consider an ensemble of CNNs to detect video face manipulations, while those of [21] introduce the multi-head attention and fine-grained classification to detect deepfake videos, showing that the approach is robust to low-quality videos.Liu et al. [22] analyze the frequency domain signal of the deepfake videos and utilize the phase spectrum to obtain more information.Finally, the authors of [23] provide a semantic approach to deepfake detection, making use of a biological signal called Photoplethysmography (PPG), an optical technique that can detect subtle changes resulting in skin color due to blood in peripheral circulation through the face.

Audio-Only Deepfake Detection
The rapidly improving quality of synthetic speech generation has garnered increasing interest in speech deepfake detection.To do so, the scientific community has proposed numerous speech deepfake detectors that employ different detection approaches and strategies [24].These can be broadly categorized into two groups based on the aspect they use to perform the detection task.The first group focuses on low-level features, looking for artifacts introduced by the generators at the signal level.The second group focuses on higher-level features representing more complex aspects as the semantic ones.
An example of an artifacts-based approach is presented in [25], where channel pattern noise analysis is used to secure Automatic Speaker Verification (ASV) systems against physical attacks.The authors of [26,27] exploit bicoherence features based on the assumption that a genuine recording has more significant non-linearity than a fake one.Alternatively, the authors of [28] propose an end-to-end network training for extracting deep features from speech, while those of [29] use Mel-Frequency Cepstral Coefficient (MFCC) features and an Support Vector Machines (SVM) classifier.Finally, new approaches to improve the practicality of existing detectors in real-world scenarios are proposed in [30,31].
Detection approaches that rely on semantic features operate under the assumption that, while deepfake generators can synthesize low-level aspects of the signals, they are unable to replicate more intricate high-level features.For instance, [32] exploits classic audio features inherited from the Music Information Retrieval (MIR) community to perform speech deepfake detection.Similarly, the authors of [33] leverage the lack of emotional content in synthetic voices generated via Text-to-Speech (TTS) techniques to recognize them, while [34] combines ASV and prosody features.
Other semantic aspects that can be exploited to perform speech deepfake detection are those related to the speaker identification problem, which refers to automatically identifying the identity of the speaker from a set of recognized voices [35].At present, the most cuttingedge methods proposed to address this task are based on the use of x-vectors [36].These are fixed-length features extracted by a DNN trained to discriminate between different speakers and can capture subtle speaker's distinctive attributes, such as pronunciation, accent, and speaking style.

Audio-Visual Deepfake Detection
In recent years, there has been an increasing interest in the development of multimodal deepfake detection methods that can simultaneously analyze multiple modalities to achieve accurate and robust results.By analyzing multiple modalities at the same time, a detector can leverage inconsistencies or artifacts across different modalities, enhancing its detection capabilities.For instance, a deepfake video sequence may have realistic facial expressions but unnatural background sounds or mismatched lip movements.
For example, Ref. [37] leverages the incongruity between emotional cues portrayed by audio and visual modalities, while Ref. [11] integrates temporal data from image sequences, audio and video frames.Moreover, the results of [38] show that an ensemble of audio and visual baselines outperforms monomodal counterparts.The authors of [39] replace the standard MFCC features with an embedding of a DNN trained for automatic speech recognition, and then incorporate mouth landmarks.In [40], the authors establish a mapping between audio and video frames by analyzing the changes in the lip opening degree.In [7], the authenticity of a speaker is verified by detecting anomalous correspondences between his facial movements and what he says, while Ref. [41] exploits the inconsistency of lip shape between the audio and video signals.
Although multimodal detectors have shown great effectiveness, these systems are usually data-driven and require a large amount of data to be trained effectively.Unfortunately, in the literature there is a lack of challenging datasets that contain both fake video and audio, which makes it difficult to train and evaluate the performance of multimodal forensic detectors.In the recent years, few multimodal datasets have been proposed, containing both counterfeited video and audio tracks.These are DFDC [42], FakeAVCeleb [43], and DeepfakeTIMIT [44] with TIMIT-TTS [6].In the following sections, we provide further details on these datasets and test our proposed multimodal detector on them.

Problem Formulation and Proposed Methodology
In this paper, we consider the problem of multimodal video deepfake detection and investigate whether this can lead to more robust and reliable predictions with respect to monomodal analyses.Given a video sequence depicting a front-facing person speaking, we aim at determining if the content is authentic or it has been synthetically generated or modified.
We tackle the task by considering a multimodal approach, meaning that we analyze both the person's face and speech to perform the final prediction.In particular, we consider a video as fake when at least one between the visual and audio components is modified, while as real when both are authentic.In the following, we formulate the tackled problem in detail and illustrate the proposed methodology.

Problem Formulation
The problem we address can be formally defined as follows.Let us consider a video sequence under analysis x AV .We split it into two components: the time-series x V representing the temporal evolution of video frames showing the person's face, and the time-series x A representing the temporal evolution of the audio track capturing the person's speech.
Each of the two tracks x V and x A belong to a class y V , y A ∈ {0, 1}, where 0 means the signal of that modality is authentic while 1 indicates that it has been synthetically generated or edited.The class y AV of the complete signal x AV is defined as y AV = y V ∨ y A , where ∨ is the logical "or" operator, meaning that we consider the complete signal as fake when at least one of its two modalities is fake.
Our goal is to develop a deepfake detector D that estimates the class of the original signal x AV .Given the video sequence x AV , the detector returns a real score ŷAV ∈ [0, 1] which indicates the likelihood that x AV is fake.

Proposed Methodology
Our proposed method is composed of two stages, as shown in Figure 1.In the first stage, we leverage state-of-the-art models to extract a collection of features from a subject's facial and speech characteristics.In the second stage, we fuse these features to perform multimodal deepfake detection.In particular, we extract a feature set from some time instants of the input video, obtaining a temporal representation of it.Then, we exploit the temporal properties of the features using time-aware models to perform deepfake detection by fusing the two modalities, increasing the final detection accuracy.
< l a t e x i t s h a 1 _ b a s e 6 4 = " 1 F f V / y S p F P e L b W z s L 2 x P Y 2 8 3 a W q m V i t 5 p 0 b s t F c o X s z i y 5 I A c k R P i k T N S J j e k Q q q E k 0 f y T F 7 J m / P k v D j v z s e 0 N e P M Z v b J H z i f P / i D m W A = < / l a t e x i t >

Feature Extractor
< l a t e x i t s h a 1 _ b a s e 6 4 = " 1 F f V / y S p F P e L b W z s L 2 x P Y 2 8 3 a W q m V i t 5 p 0 b s t F c o X s z i y 5 I A c k R P i k T N S J j e k Q q q E k 0 f y T F 7 J m / P k v D j v z s e 0 N e P M Z v b J H z i f P / i D m W A = < / l a t e x i t >

Feature Extractor
B G Z A S + j + u g g e p E 9 p y y W 3 F n w M v E K 0 g Z F W j 0 n C + / n 9 A s N p d S T p T q e m 6 q g 5 x I z S i H i e 1 n C l J C R 2 Q A X U M F i U E F + S z H B J 8 Z p Y + j R J o j N J 6 p v z d y E i s 1 j k M z G R M 9 V I v e V P z P 6 2 Y 6 u g p y J t J M g 6 D z h 6 K M Y 5 3 g a S m 4 z 6 S J y 8 e G E C q Z + S u m Q y I J N c 0 o 2 5 T g L U Z e J q 1 q x b u o e H f V c u 2 6 q K O E j t E p O k c e u k Q 1 d I s a q I k o e k T P 6 B W 9 W U / W i / V u f c x H V 6 x i 5 w j 9 g f X 5 A 6 B D m S c = < / l a t e x i t > Deepfake Detector    In more details, we feed the signals x V and x A to two feature extractors F V and F A , tailored to the visual and audio modalities respectively.The outputs of the two extractors are two sets of feature vectors where each vector is extracted for a few time instants of the input signal.We develop a deepfake detector D that takes as input the two sets of features f V , f A and estimates a score ŷAV ∈ [0, 1] related to the signal x AV .We define the estimated score as We consider different versions of the detector D, depending on the strategy we choose to perform the fusion between the two modalities.

Feature Extraction
The feature extractors F V and F A we consider to compute the feature sets f V and f A are based on two well-established architectures proposed in the literature.
Regarding the visual modality, we exploit the EfficientNetB4 [45] network modified following the implementation proposed in [20], which investigates the ensembling of differently trained CNNs making use of attention layers and siamese training.The authors of the paper use the models' ensemble to perform video deepfake detection, while we propose to use it as a feature extractor.To extract features from the video frames, we select the pixel area associated with the face of the person, then we pass the face-related frames to the models' ensemble.We apply the exact implementation proposed in the original paper, therefore we refer the reader to that for more information.We decided to adopt this model as it has been shown to have excellent deepfake detection capabilities, which we believe can lead to adequate performance for the proposed multimodal classifier.
For the audio modality, we consider a Time-Delay Neural Network (TDNN) model coupled with statistical pooling to extract x-vector features from the input speech track.To do so, we exploit the pre-trained implementation provided by SpeechBrain [46].The original task for which the model was proposed is speaker recognition.Here we use it as an embedding extractor, computing a feature vector for each time window of the audio signal under analysis.
It is worth noticing that, contrarily to F V , F A is trained for a different task than the one at hand, i.e., deepfake detection.We do so because we want to adopt a semantic approach similar to the one used in [33,34], which has proved very effective against the detection of synthetic speech tracks.We face the deepfake detection by analyzing a set of high-level features, specifically related to the speaker's identity, which we assume contain sufficient information to tackle also the considered task.Our rationale is that synthetic speech generators are very good at replicating low-level aspects of speech but fail to reproduce the most complex ones, such as the speaker's identity.For this reason, we believe that high-level information can be exploited to discriminate between real and fake tracks.
The size of the feature sets f V and f A is equal to N × M V and N × M A respectively, where M V and M A are the lengths of the feature vectors extracted for each time instant while N is the numbers of time instants considered.In particular, since we want to provide an audio-visual representation of the input video sequence that is time-consistent between the two modalities, we extract the feature vectors for equally spaced time instants so that f V and f A are defined over the same number of time frames N.

Deepfake Detection
The second part of the proposed pipeline consists of a binary classifier that takes as input the two feature sets f V and f A and returns a real score ŷAV associated with the input signal x AV .Since the features are defined as a function of the time instants, we implement the classifier using a time-aware model to exploit as much as possible the temporal correlations between and within the two modalities.
Specifically, we propose three different types of deepfake detectors D which differ in how the fusion between the feature sets f V and f A is performed.To better investigate the differences between the considered fusion strategies, we build the detectors D making use of the same inner network structure as a classifier to process the input feature sets.Since we work with different data modalities, we call the generic classifier model C m , where m ∈ {V, A, AV} depending on the modality of the content analyzed, i.e., visual-only, audio-only and audio-visual.
The proposed architecture for C m consists of a Transformer-based model [47] that leverages the temporal aspect of the features.It comprises an input embedding layer that maps the input features to a hidden dimension, a positional encoding layer, a transformer encoder layer that processes the input sequence, and a fully connected layer that performs the final binary classification.The output layer employs a softmax function to return a probability estimate of whether the analyzed input feature is extracted from a fake signal.
The dimensionality of the latent space at the output of the transformer is the same as that of its input.This is because this approach enables the model to better preserve and analyze the information contained within the input sequence.Figure 2 shows the generic architecture of the proposed model C m .The size M m of the input feature vector varies according to the considered modality m.
In the next lines, we list the three fusion strategies we propose in this work.These offer practical approaches for performing multimodal deepfake detection, focusing on efficient implementations and usability in real-world scenarios.The proposed setups can be readily implemented on existing monomodal deepfake detectors or serve as the foundation for building new models, depending on the needed requirements and preferences.For clarity's sake, Figure 3 shows the pipelines of the strategies, called Late Fusion, Mid Fusion and Early Fusion.
R l e 4 c 1 6 s l 6 s d + t j V l q x y p 5 9 + A P r 8 w e a O p X E < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " v T h n B G c 8 x 5 r Z q z g 2 p 7 t p 1 3 n D R B q a x j C n h 6 6 f 4 f 1 L N 2 8 6 x 7 V y 7 u d L p r I 4 0 2 k V 7 6 A A 5 q I B K 6 B K V U Q U x l K A H 9 I S e r X v r 0 X q x X q e j K W u 2 s 4 1 + w H r 7 B C e A l k c = < / l a t e x i t >

D MF
< l a t e x i t s h a 1 _ b a s e 6 4 = " / m t r s S 7 2 O D X p d K m / W 7 N I I I s Late Fusion.In the Late Fusion strategy, the deepfake detector considers a dedicated classifier for each modality, which we call C V and C A .We separately train the two classifiers only on visual (C V ) and audio (C A ) data.In testing phase, every classifier takes as input the feature set associated with the related modality and returns a score such that The final multimodal score assigned to the video sequence is computed by averaging the monomodal ones, ŷAV = ( ŷV + ŷA )/2.
We define this detector as D LF , being Mid Fusion.Regarding the Mid Fusion strategy, we consider two classifiers C V and C A that are still separated for the two modalities, though being merged in their final dense layers.
In more details, for each classifier, we extract the feature embedding obtained before the final fully connected layer.We concatenate the embeddings associated with each data modality, ending up with a multimodal embedding vector with size 1 × (M V + M A ).Then, we provide the computed multimodal embedding as input to a fully-connected layer that returns the final score ŷAV .Differently from the Late Fusion strategy, we train the Mid Fusion strategy end-to-end.In this way, the two classifiers update their related parameters considering the contributions of both modalities.We define the Mid Fusion detector as D MF , being Early Fusion.In the Early Fusion strategy, we consider a unique classifier C AV that takes as input the concatenation of the two feature sets f AV = [f V , f A ] and directly returns the score ŷAV .The feature vectors of the two modalities are concatenated along the featuredimension, so that the final size of f AV is equal to N × M AV , where The idea behind this fusion strategy is that, when we provide the detector with multimodal information at an early stage, it can exploit the audio-visual correlations better, which may benefit the final detection capabilities.We define the Early Fusion detector as

Experimental Setup
In this section we provide the reader with some insights regarding the experimental setup used to assess the performances of the proposed detectors.First, we describe the datasets considered for training and testing all the stages of the systems.Then, we give more details on the processing pipeline, providing the parameters for the extraction of audio and visual features and those for the deepfake detector.Finally, we present the procedure used to train the considered models.

Considered Datasets
As mentioned in Section 2, in the multimedia forensics literature the multimodal deepfake datasets that have been released are few and are not enough to perform comprehensive studies by training models on specific sets and testing them on unseen data.This is a significant limitation that restricts the development of new multimodal detectors.In this paper, we try to overcome this problem and show how multimodal analyzes can be more robust and reliable even when the considered models are trained on monomodal datasets that are unrelated to each other.Following this approach, we train the proposed detectors on visual-only (i.e., FaceForensics++) or audio-only (i.e., ASVspoof 2019) monomodal deepfake datasets and test them on multimodal audio-video corpora.Here we present in detail all the considered datasets.

Training Datasets
FaceForensics++ [18].This is a visual-only deepfake dataset containing 5000 videos which were generated using four different deepfake generation methods using a base set of 1000 real YouTube videos.It includes two partitions corresponding to different compression pipelines applied to the videos.In particular, the dataset includes two values of Quantization Parameter (QP), QP = 23 and QP = 40, where higher QP means lower quality.
We use this dataset to train the C V model, considering the train and validation splits released by the authors.Then, we exploit the test split for a preliminary monomodal evaluation.As for the two partitions of QP, we merge them to make the training and evaluation processes more robust.ASVspoof 2019 [48].This is a speech audio dataset that contains both real and synthetic tracks generated based on the VCTK corpus [49].In particular, we consider the Logical Access (LA) partition, which relates to the synthetic speech detection problem.This contains more than 120,000 audio tracks, all at a sampling frequency of f s = 16 kHz.The LA partition is split into three sub-partitions, namely train, dev and eval, which contain authentic signals along with synthetic speech samples generated with various methods.The train and dev partitions have been created using a set of six synthesis algorithms, while eval includes samples generated with thirteen techniques, different from those used in train and dev.
We use the train and dev partitions during the training phase of the C A model, while we exploit the eval split to test the detector in a monomodal scenario.

Evaluation Datasets
We evaluate the proposed audio-video detectors on multiple state-of-the-art multimodal deepfake datasets.We do so since we want to test their robustness against various types of forgeries and anti-forensic attacks, aiming at replicating real-world evaluation scenarios.In the forensic field, it is crucial for a detector to exhibit reliable and robust predictions even when tested on data that differs from the ones seen during training.Hence, the ability of a model to generalize across different types of data becomes an important aspect to consider and by testing it on diverse datasets we can effectively evaluate their performance in these terms.Here we introduce the deepfake datasets we considered in the multimodal evaluation setup.FakeAVCeleb [43].This is a multimodal deepfake dataset that contains 500 real videos extracted from the VoxCeleb2 corpus [50], used as a base set to generate around 20,000 deepfake videos through various deepfake generation methods.Deepfake video frames have been generated with Faceswap [51] and FSGAN [52], while the deepfake audios have been synthesized using Real-Time Voice Cloning (RTVC) [53].Then, Wav2Lip [54] has been applied to synchronize the video frames with the audio.DFDC [42].This multimodal deepfake dataset contains nearly 120,000 videos, of which 100,000 are labeled as "Fake" and the rest as "Real".The videos are divided into 50 folders, numbered from 0 to 49, where each subset contains a set of real videos, along with all derivative fake videos.While the videos are largely visual-only fakes, some samples included in divisions 45 to 49 contain falsified audio in addition to possible falsified video.Since our goal is to perform multimodal experiments, we consider only the videos within these folders as test dataset, for a total of 12,547 samples.VidTIMIT [55].This is a multimodal dataset that includes only real video recordings of 43 people reciting short sentences, considering 10 videos per subject, for a total of 430 videos.It has been widely used for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification.The recorded sentences are extracted from the test section of the TIMIT corpus [56].DeepfakeTIMIT [44].This is a video deepfake dataset including only fake video samples, generated starting from the VidTIMIT corpus presented above.The forgery process regards only the visual content of the video sequences; specifically, the forged video frames were generated with a Generative Adversarial Network (GAN)-based approach developed from Faceswap [51].The generated deepfakes belong to 32 subjects and are released in two versions: a low quality (LQ) and a high quality (HQ), with different frame sizes.This set includes a total of 640 videos with swapped faces (320 for each quality version).In our experiments, we merge LQ and HQ subsets, considering them as a unique corpus.TIMIT-TTS [6].This is a speech dataset including only fake audio samples, generated starting from the VidTIMIT corpus.This dataset contains four partitions, corresponding to different post-processing pipelines applied to audio tracks.Here we consider the Dynamic Time Warping (DTW) subset, which includes almost 20,000 synthetic speech tracks synthesized using twelve different TTS algorithms and then passed through a DTW system to sync them to the reference videos, increasing their realism.This corpus can be used as a standalone synthetic audio dataset or combined with VidTIMIT and DeepfakeTIMIT sets to perform multimodal research.
In the following experiments, we combine the VidTIMIT, DeepfakeTIMIT and TIMIT-TTS datasets and consider them as a unique multimodal deepfake corpus, which we refer to as TIMIT.

Processing Pipeline 4.2.1. Feature Extraction
The two feature extractors F V and F A work to capture the content of the input video sequence over time.In particular, to capture fine-grained temporal changes, we consider an extraction frequency equal to 10 Hz.Concerning visual information, this is done by selecting 10 evenly spaced frames within a second and extracting a feature from each of them.Concerning speech information, we divide the signal considering nonoverlapped time windows of 100 ms and extracting a feature from each of them.At the end of the feature extraction process, visual and spatial features are synchronized and describe information evolving in time at 10 samples per second.Regarding the temporal dimension, we analyze the input signals over a time window T W = 3.0 s.We adopt this window length because, from preliminary experiments, it turned out to be a good compromise between the shortness of the window and the performance of the detector, which is desired in a real-world scenario.
For both feature extractors, we exploit the pre-trained models released by the authors of the respective papers.In particular, F V was trained on FaceForensics++, while F A was trained on Voxceleb [57] and Voxceleb2 [50] datasets, considering audio data sampled at 16 kHz.Finally, at each considered time instant, the number of features extracted from the visual content is equal to M V = 1072, while those extracted from the audio content are M A = 512.Considering 10 samples per second over a time window of 3.0 s, the final temporal dimension of the features is equal to N = 30.Therefore, the size of the visual feature f V is equal to 30 × 1072, while the size of the audio feature f A is equal to 30 × 512.

Deepfake Detector
As reported in Section 3, all the considered deepfake detectors share the same architecture C m .The input shape of the networks is equal to N × M m , where M m depends on the feature set we are considering, i.e., m ∈ {V, A, AV}.All the considered models contain a transformer encoder that presents a single hidden layer with 8 attention heads, 0.1 dropout, and GELU as activation function.
Each input feature set is normalized to have zero mean and unitary variance, both in training and test.In the Early Fusion strategy, when the features are concatenated before feeding them to the model, the normalization is performed independently between the modalities, prior to the concatenation.

Training Strategy
All the hyperparameters used to train the considered models have been selected to maximize the classification accuracy.In particular, we consider a number of epochs equal to 150 with an early stopping patience at 15 epochs, weighted cross-entropy as loss function and Adam optimization.We adopt a learning rate equal to 10 −3 , a weight decay of 10 −4 , and we reduce the learning rate on plateau of the validation loss by a factor 0.1.
During training we pay attention to balancing the classes in order to compensate for the imbalance of the training datasets.In particular, we oversample the tracks of the less represented class, ensuring that each training batch contains the same number of samples from the "Real" and "Fake" classes.

Results
In this section we analyze and discuss the results achieved by the proposed techniques for multimodal deepfake detection.

Evaluation Metrics
We evaluate the performances of the considered detectors using Receiver Operating Characteristic (ROC) curves and confusion matrices, considering as evaluation metrics the Area Under the Curve (AUC) and the Balanced Accuracy (BA).In general, we evaluate the BA as a function of the threshold t applied to the likelihood score returned by the detector to estimate the class of the query video sequence (i.e., "Real" or "Fake").If the likelihood exceeds the threshold, the sequence is classified as "Fake", otherwise it is classified as "Real".We define the BA at threshold t as where TPR t and TNR t are the True Positive Rate (TPR) and True Negative Rate (TNR) of the tackled binary decision problem at fixed threshold t, respectively.Optimal performances are achieved when both AUC and BA approach values next to 1.In all the considered investigations, we apply a standard threshold t = 0.5 to the output likelihood, ending up with BA 0.5 as evaluation metrics.Nonetheless, we show that there are a few scenarios where better results can be achieved by aptly modifying this value.

Monomodal Results
As a preliminary experiment, we test the effectiveness of the monomodal detectors in their respective domains.The reason behind this choice is that good visual and audio classifiers are essential for building an excellent multimodal detector.Our proposal focuses on fusion strategies designed for merging monomodal deepfake detectors.As a result, the performances of the fused model are directly influenced by those of the starting detectors being used.If the monomodal detectors do not work properly, it would be necessary to act on them before their fusion in the multimodal investigations.Therefore, we exploit the monomodal scores defined in (3) to evaluate our performances on the test partitions of monomodal datasets (i.e., FaceForensic++ for visual data and ASVspoof 2019 for audio data).
Figure 4 shows the results of this preliminary analysis.The two classifiers show excellent detection performances, with an AUC of 0.91 for D V and an AUC of 0.96 for D A , along with BA 0.5 of 0.83 and 0.90, respectively.These results are consistent with those of many cutting-edge detectors reported in the literature [20,34], indicating that the proposed monomodal classifiers are suitable for subsequent multimodal experiments.

Multimodal Results
In each of the following multimodal experiments, we evaluate the proposed detectors only on datasets different from the monomodal datasets used to train the classifiers.Performing cross-dataset tests represents a challenging scenario that resembles "in-the-wild" conditions, which enables to evaluate the robustness of the proposed strategies against different forgeries and anti-forensic attacks.Also, we are aware that training on monomodal datasets could impact the achieved performance on multimodal ones.A notable limitation of this approach is that the detectors are unable to leverage all the intra-modality relationships within the content since these relationships are not accessible during training.Due to this aspect, the proposed system is unable to detect synthetic content that appears realistic in individual modalities but lacks synchronization between audio and video, even if simpler detectors trained explicitly with this purpose could easily spot such inconsistencies.Still, we want to investigate whether a modality fusion process can improve the detection capabilities even though the data seen in training are "partial".

Best Fusion Strategy Selection
As a first experiment, we examine the fusion strategies introduced in Section 3.2.2 and contrast their respective performances, investigating which one leads to more robust predictions.For this test we evaluate the detectors only on multimodal deepfakes that share the same class between video and audio (i.e., both are either real or fake), excluding videos where only one of the modalities is edited (e.g., fake video and real audio or vice versa).
Figure 5 shows the ROC results of this analysis, broken down by the considered test dataset.On average, Early Fusion is the most effective fusion strategy, enabling to achieve AUCs larger or equal 0.90 for two datasets out of three, and being the best performing strategy on the remaining dataset.As a matter of fact, Early Fusion can exceed the other fusion strategies by 7% and 10% on FakeAVceleb and DFDC datasets, respectively, while being competitive on the TIMIT dataset.We believe this technique enables the detector to deeply analyze both the relationships between and within the modalities, thereby enhancing the robustness of its predictions.We observe that the scored AUC values display significant variations depending on the test dataset under analysis, reaching poor values in the case of the DFDC set.This is likely due to distinct characteristics between training and test data, which can adversely impact the detector predictions.One further approach we could consider is the recursive application of a Late Fusion strategy, fusing the scores obtained from the three proposed methods by averaging them.The results achieved using this strategy are AUC = 0.88 for FakeAVceleb, AUC = 0.96 for TIMIT, and AUC = 0.78 for DFDC.While we acknowledge that on certain datasets this approach improves the results reported before, we believe that it brings limited novelty to the analysis.First, it considers a fusion strategy that has already been previously explored.Additionally, from a computational perspective, this strategy may not be practical as it needs to use three different models to obtain a score.This can introduce unnecessary computational overhead without significantly enhancing the overall performance.Consequently, for these reasons, we decided not to consider this approach in the following analyses.
To further deepen our investigations, we compute the confusion matrices to evaluate the performance of the detectors D LF , D MF and D EF on the three considered multimodal deepfake datasets.Results are depicted in Figure 6 (D LF ), Figure 7 (D MF ) and Figure 8 (D EF ).In all cases, we apply a standard fixed threshold t = 0.5 to the estimated likelihood associated with each video sequence.The BA 0.5 values reinforce the results observed with the ROC curves, with Early Fusion that proves again to be the best fusion strategy.However, Since Early Fusion proves to be the best-performing strategy among the three proposed ones, we consider this for all the remaining evaluations.

Multimodal vs. Monomodal Detection
We now compare the performances of the developed Early Fusion multimodal detector with those of the corresponding monomodal models.We do so since we want to test whether a multimodal analysis is more robust and reliable than a monomodal one.We recall again that our multimodal models are trained solely on monomodal data, so they do not require any additional training datasets.In this experiment the monomodal models serve as a baseline for our study.The purpose is to assess whether the multimodal approach proposed in our work offers advantages compared to a monomodal one.By comparing the performance of the proposed detector against the baselines, we can determine the potential benefits and improvements achieved through a multimodal approach.As done in the previous experiment, we only analyze deepfakes in which both the video and audio signals belong to the same class and exclude samples where only one modality is manipulated.This is done because monomodal detectors, by nature, cannot detect these types of forgeries.
Figure 9 shows the ROC results broken down for each test dataset, while Table 2 compares the AUC, BA and BA best t values for the three methods.The multimodal approach consistently outperforms the monomodal detectors, supporting the considerations made in our investigations.As a last experiment, we expand the analysis to include also deepfakes with mixed class labels (i.e., real video frames and fake audio or vice versa).In doing so, we evaluate

Conclusions and Future Works
In this paper we presented a novel approach for detecting multimodal deepfake videos by combining visual and audio information.The proposed method was used to determine the authenticity of an input video sequence, combining data-driven features extracted from the visual content with speaker-identity features from the audio stream.We evaluated several training and test methods, and various modality fusion strategies.The results indicate that robust predictions are achieved when an Early Fusion approach is considered.
The peculiarity of the proposed detector is that its training phase does not take place on multimodal deepfake data but on monomodal deepfake samples (i.e., that contain either modified video frames only or modified audio samples only), thus not requiring additional multimodal training data.Despite this "partial" training strategy, the model is able to outperform detectors trained only on monomodal data, underlining the goodness of using a multimodal approach.
In future studies we want to experiment with new methods of fusion between modalities, such as "informed" fusion methods.This means the contribution of the different modalities is weighted with respect to the relevance they may have in the accuracy of the final prediction.
< l a t e x i t s h a 1 _ b a s e 6 4 = " 2 e 2 a T a E 4 p s m o 7 G w b m Q 0 3 F y b 6 z Y < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 W Q T G J F 8 g 4 3 g Q k M 9 q 6 I h i f b T 6 O o = " > A A A B + n i c b V B N S 8 N A E N 3 4 W e t X q k c v w S J 4 K o m I e q x 6 8 V j B f k A T w m a 7 a Z d u N m F 3 o o a Y n + L F g y J e / S X e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F C W c K b P v b W F p e W V 1 b r 2 x U N 7 e 2 d 3 b N 2 l 5 H x a k k t E 1 i H s t e g B X l T N A 2 M O C 0 l 0 i K o 4 D T b j C + n v j d e y o V i 8 U d Z A n 1 I j w U L G Q E g 5 Z 8 s + a O M O R Z 4 b t A H y G / 7 B S + W b c b 9 h T W I n F K U k c l W r 7 5 5 Q 5 1 C M 7 8 y 4 u k c 9 J w z h r O 7 W m 9 e V X G U U E H 6 B A d I w e d o y a 6 Q S 3 U R g Q 9 o G f 0 i t 6 M J + P F e D c + Z q 1 L R j m z j / 7 A + P w B 9 T C U d A = = < / l a t e x i t > ŷAV < l a t e x i t s h a 1 _ b a s e 6 4 = " H h g 4 n H M v 9 8 w J E s 4 U 2 P a 3 s b K 6 t r 6 x W d u q b + / s 7 u 2 b B 4 c 9 F a e S 0 C 6 J e S w H A V a U M 0 G 7 w I D T Q S I p j g J O + 8 H 0 p v T 7 D 1 Q q F o t 7 y B L q R X g s W M g I B i 3 5 Z s O N M E y C M A 8 L 3 w X 6 C H m v 8 M 2 m 3 b J n s J a J U 5 E m q t D x z S 9 3 F J M 0 o g I I x 0 o N H T s B L 8 c S G O G 0 q L u p o g k m U z y m Q 0 0 F j q j y 8 l n 4 w j r R y s g K Y 6 m f A G u m / t 7 I c a R U F g V 6 s o y q F r 1 S / M 8 b p h B e e T k T S Q p

Figure 2 .
Figure 2. Architecture of the classifier C m .
6 r p N q P y d S M 8 p h b H u Z g p T Q I e l D 1 1 B B Y l B + P o 0 wx i d G C X G U S H O E x l P 1 9 0 Z O Y q V G c W C c M d E D N T + b i P / N u p m O L v y c i T T T I O j s o S j j W C d 4 0 g c O m Q S q T e 6 Q E S q Z + S u m A y I J N a U o 2 5 T g z k d e J K 1 a 1 T 2 r u n e 1 S v 2 y q K O E j t A x O k U u O k d 1 d I s a q I ko e k T P 6 B W 9 W U / W i / V u f c y s S 1 a x c 4 D + w P r 8 A d + g l x c = < / l a t e x i t >Early Fusion< l a t e x i t s h a 1 _ b a s e 6 4 = " 7 W Q T G J F 8 g 4 3 g Q k M 9 q 6 I h i f b T 6 O o = " > A A A B + n i c b V B N S 8 N A E N 3 4 W e t X q k c v w S J 4 K o m I e q x 6 8 V j B f k A T w m a 7 a Z d u N m F 3 o o a Y n + L F g y J e / S X e / D d u 2 x y 0 9 c H A 4 7 0 C T 1 b 9 9 a j 9 W K 9 T k d T 1 m x n G / 2 A 9 f Y J G 1 C W P w = = < / l a t e x i t > D EF < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 q N R wT R Q d F U d b l v m l p F 9 K u k I T I 0 = " > A A A C A H i c b V D L S g M x F L 1 T X 3 V 8 j b p w 4 S Z Y B F d l p g t 1 W R T E j V D B P q B T S i b N t K G Z z J B k h D J 0 4 6 + 4 c a G I W z / D n X 9 j p p 2 F t h 4 I H M 4 5 N 8 k 9 Q c K Z 0 q 7 7 b Z V W V t f W N 8 q b 9 t b 2 z u 6 e s 3 / Q U n E q C W 2 S m M e y E 2 B F O R O 0 q Z n m t J N I i q O A 0 3 Y w v s 7 9 9 i O V i s X i Q U 8 S 2 o v w U L C Q E a y N 1 H e O f E K F p p K J o X 3 H B u g m z a O + 3 3 c q b t W d A S 0 T r y A V K N D o O 1 / + I C Z p Z G 4 j H C v V 9 d x E 9 z I s N S O c T m 0 / V T T B Z I y H t G u o w B F V v W y 2 w B S d G m W A w l i a I z S a q b 8 n M h w p N Y k C k 4 y w H q l F L x f / 8 7 q p D i 9 7 G R N J q q k g 8 4 f C l C M d o 7 w N N G C S E s 0 n h m A i m f k r I i M s M T G V K N u U 4 C 2 u v E x a t a p 3 X v X u a 5 X 6 V V F H G Y 7 h B M 7 A g w u o w y 0 0 o A k E p v A M r / B m P V k v 1 r v 1 M Y + Wr G L m E P 7 A + v w B K l y W I A = = < / l a t e x i t > Mid Fusion < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 W Q T G J F 8 g 4 3 g Q k M 9 q 6 I h i f b T 6 O o = " > A A A B + n i c b V B N S 8 N A E N 3 4 W e t X q k c v w S J 4 K o m I e q x 6 8 V j B f k A T w m a 7 a Z d u N m F 3 o o a Y n + L F g y J e / S X e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F C W c K b P v b W F p e W V 1 b r 2 x U N 7 e 2 d 3 b N 2 l 5 H x a k k t E 1 i H s t e g B X l T N A 2 M O C 0 l 0 i K o 4 D T b j C + n v j d e y o V i 8 U d Z A n 1 I j w U L G Q E g 5 Z 8 s + a O M O R Z 4 b t A H y G / 7 B S + W b c b 9 h T W I n F K U k c l W r 7 5 5 Q 5 1 C M 7 8 y 4 u k c 9 J w z h r O 7 W m 9 e V X G U U E H 6 B A d I w e d o y a 6 Q S 3 U R g Q 9 o G f 0 i t 6 M J + P F e D c + Z q 1 L R j m z j / 7 A + P w B 9 T C U d A = = < / l a t e x i t > ŷAV < l a t e x i t s h a 1 _ b a s e 6 4 = " H h g 4 n H M v 9 8 w J E s 4 U 2 P a 3 s b K 6 t r 6 x W d u q b + / s 7 u 2 b B 4 c 9 F a e S 0 C 6 J e S w H A V a U M 0 G 7 w I D T Q S I p j g J O + 8 H 0 p v T 7 D 1 Q q F o t 7 y B L q R X g s W M g I B i 3 5 Z s O N M E y C M A 8 L 3 w X 6 C H m v 8 M 2 m 3 b J n s J a J U 5 E m q t D x z S 9 3 F J M 0 o g I I x 0 o N H T s B L 8 c S G O G 0 q L u p o g k m U z y m Q 0 0 F j q j y 8 l n 4 w j r R y s g K Y 6 m f A G u m / t 7 I c a R U F g V 6 s o y q F r 1 S / M 8 b p h B e e T k T S Q p

1 a 1 b 2 o u n e 1 SFigure 3 .
Figure 3. Different fusion strategies considered to perform multimodal deepfake detection.

Figure 4 .
Figure 4. Evaluation of the considered detectors on monomodal datasets.

Figure 5 .
Figure 5. Evaluation of the considered multimodal datasets considering different fusion strategies.

Figure 9 .
Figure 9. Evaluation of the considered detectors on multimodal datasets considering monomodal (i.e., visual-only or audio-only) against multimodal approaches.

Figure 10 .
Figure10.Evaluation of the D detector (Early Fusion) on mixed classes (real audio and fake video and viceversa).The case where both video and audio are fake is excluded.

Table 1 .
AUC and BA values obtained testing the proposed detectors considering different fusion strategies and different thresholds t.

Table 2 .
AUC and BA values at different thresholds t, obtained testing the proposed detectors on multimodal datasets considering monomodal (i.e., visual-only or audio-only) against D EF (Early Fusion) detector.