1. Introduction
Face-recognition (FR) enables users to simply look at the camera and be authenticated. This seamless functionality, however, is reliant upon the authentication being secure. In principle, this is the case. Advances in deep learning have brought precise identification algorithms, capable of discerning one person from over 50,000 [
1]. Protection against imposters is now also an industry standard [
2]; most applications have the capability to discern live people from pictures, videos and simple masks [
1]. A new problem, however, is the photo-realistic face-swap-attack (FSA). Through generative adversarial networks, attackers can imperceptibly present any face desired [
3].
To appreciate the risk of FSAs, consider the FR threat model in
Figure 1. Historically, the primary FR imposter-attack is presenting a facsimile (scenario 2). To address this, many FR applications now include liveliness-verification technologies. This can include 3D sensing [
4], analyzing physiological motions (blink [
5], respiration [
6] and pulse [
7]) and now spatio-temporal fusion [
8]. While still an on-going issue, there is general defense against spoofing attacks. Swapping attacks (scenario 3), however, have the potential to bypass these methods. State-of-the-art algorithms imperceptibly align and blend the swapped face, retaining fundamental liveliness features. This means a FSA can be an extremely potent means to gain access to phones [
9], ATMs [
10], buildings [
11] and even vehicles [
12].
FSAs also provide the benefit of discretion. If an imposter tries to spoof an ATM or building, bystanders could clearly notice someone wearing a mask. They can attempt to perform the spoofing attack when others are not around, but this would clearly limit their opportunistic window. Conversely, the opportunistic window could instead be used to place an interception device. From there, they can discretely authenticate at any future point in time.
Lastly, FSAs are a unique way to deny service (scenario 4). Simplistic service denial would be to imperceptibly alter faces to prevent authentication. A malicious use case would be to avoid surveillance, such as electronic border control. Regardless of how well camera arrays are planned out, using deep learning to present another person (or remove their presence) would go undetected today.
These problems are exacerbated by the ease of success. Many FR systems only use message authentication (and not encryption) to reduce overhead [
13]. This inherently makes the data vulnerable to tampering. Furthermore, quality face-swapping tools are ubiquitous. Social media platforms include applications to photo-realistically swap faces and backgrounds [
14,
15]. This lack of encryption, combined with accessible tools, means any hacker with an interception device can successfully apply FSAs.
Hence, the purpose of this research is to develop an efficient authenticity-verification framework for real-time FSA mitigation. A noise-verification framework is presented to detect changes in sensor noise caused by swapping (in this case photo-response non-uniformity, PRNU) [
16]. PRNU is historically known to be useful for tampering analysis via anomaly detection methods [
16,
17,
18]. This framework instead proposes verification against a compressed enrollment for speed. Given most FR systems use a known sensor, it can be securely characterized for authentic PRNU. Each following frame is then compressed and analyzed for authenticity using peak correlation energy deviation. Local sensitivity is provided through the use of sub-zones.
Face-Swap-Attack Contributions
This research presents a novel compressed noise-verification framework to mitigate FSAs. The source camera is characterized for PRNU, where subsequent frames are verified for authenticity. Face-swap-attacks are mitigated by analyzing only the zone containing the face centroid; service-denial-attacks require full image analysis. Benchmarking against three open-source algorithms shows the proposed framework is both robust and significantly more efficient than existing solutions. In summary, this research presents the following contributions:
Extremely efficient face-crop verification (93.5% accuracy, 4.6 ms on CPU)
Robust full-image verification (100% accuracy, 106.9 ms on CPU)
2. Related Works
Detecting face-swap-attacks (FSAs) starts with traditional image tampering detection. In his Image Forgery Detection survey, Dr. Hany Farid presented how images are commonly tampered, along with corresponding detection methods [
19]. Images can be tampered for various reasons; people have been known to present fake imagery to forge alibis [
20], steal identities [
21] or create compromising material [
22]. This can be accomplished using a variety of hand-crafted tools, such as: cloning, duplicating parts of the image to conceal or embed information, re-sampling, adapting the resolution to modify scene proportions and splicing, combining multiple images together to create a new scene [
19].
2.1. Forensics-Based Tampering Detection
To detect these image attacks, a forensics approach can be taken using noise artifacts.
Figure 2 shows some of the fundamental components to a camera, where manufacturing tolerances make it nearly impossible to create a perfectly noiseless component. These artifacts can be quantified into a robust “noise fingerprint” or “noiseprint” [
19].
One particularly famous “noiseprint” is photo-response non-uniformity (PRNU). This is an estimation of photo receiver noise with respect to a constant uniform light [
19] as shown in
Figure 3. PRNU is known to be unique across cameras, as validated in the large-scale source identification from Goljan et al. [
23]. This uniqueness means that deviations in PRNU can be used to detect tampering. Traditional tampering (e.g., cut-and-splice) has been successfully detected with PRNU anomaly detection techniques [
16,
17] as well as deep learning networks trained on PRNU [
24,
25]. The logical question is whether noise analysis can identify deep-learning-based tampering (e.g., deepfakes).
2.2. Deep-Learning Tampering Detection
Deepfake detectors often focus on fake videos. Videos incorporate temporal information, whereby it is inherently harder to replicate realistic movements over extended periods. Guera et al. demonstrated very strong performance, hitting 96% accuracy on test videos when using 20 frames [
26]. Others have similarly pursued multi-frame analysis. Sabir et al. demonstrated that strong performance can be achieved with fewer frames if aligning the faces [
27]; however, Tariq et al. noted large-scale robustness over varied perspectives does benefit from increasing the frame count (in this case 16) [
28]. This approach seems promising, but is inherently not ideal for real-time applications due to the latency in acquiring numerous frames.
To avoid this latency, researchers postulate that face-swap generative adversarial networks (GANs) leave signature traces. Rossler et al., developers of the famous FaceForensics++ data-set [
29], demonstrated performance across a slew of convolutional neural networks. Their network, XceptionNet, performs very well (99% authenticity verification correctness on raw images, 81% on compressed) [
29]. This has inspired numerous other high-performing models [
30], where some introduce multi-stream fusion [
31], multi-task learning [
32] and optical flow [
33].
2.3. Literature Opportunity
All of these approaches, however, have a fundamental constraint. Each detector must be trained on the specific face-swap GAN. When new GANs are developed, the detector must be re-trained on their signature traces or risk being vulnerable. The ultimate goal is to find use a feature that is invariant to the tampering method and can be calculated in real time. This brings a return to fundamental camera forensics. Concurrent to this research, researchers are demonstrating face-swaps similarly present anomalies in PRNU. Lugstein et al. published that PRNU spectral trace analysis is a reliable feature [
18]. They achieved between 95% and 99% in authenticity verification across a multitude of data-sets [
18]. Mohanty also corroborated this methodology by presenting a high-level framework for verifying PRNU against enrolled templates [
34].
This research provides novelty over that of Lugstein and Mohanty by incorporating compressed zonal analysis in the PRNU verification. The introduction of compression is important for run-time latency; this algorithm needs to be extremely efficient or it will not be deployed on any of the real-time applications from the introduction. The compression, however, inherently reduces feature precision. The zonal analysis provides local sensitivity to retain robustness. Lastly, it is key to note that both papers are published concurrently to this research; no prior knowledge of their methods is used here.
In summary, this framework contributes through state-of-the-art efficiency. PRNU analysis is clearly becoming a respected method for determining deep-network manipulations. This research extends the field by introducing compressed verification with zonal analysis. This enables best-in-class performance in a fraction of the run-time.
3. Materials and Methods
Image tampering is broadly defined as any digital manipulation of the original source. The fundamental hypothesis is that any image modification inherently changes the underlying “noiseprint” (e.g., PRNU); hence, this methodology uses deviation in the expected “noiseprint” to detect and localize tampered segment(s).
3.1. Photo-Response Non-Uniformity Estimation
The PRNU calculation employs the methodology presented by Goljan et al. [
23]. In general, an image can be described as the sum of the incident light received by the camera, artifacts introduced by camera intrinsics and temporal noise. This is expressed in Equation (
1), where the incident light is represented by
, the camera noise (PRNU) is represented by
K and other noises (quantization, shot, dark current, temporal, etc.) are represented by
:
To isolate the “noiseprint”, the image must be first filtered to remove noise
. This can be accomplished by applying a Wiener filter,
F, to generate residuals
[
35]. Note that dark current is assumed to be negligible due to having sufficient scene signal. From there,
K can be isolated by applying a maximum likelihood estimator [
35]:
3.2. Peak Correlation Energy
One way to classify the camera noise source is to utilize peak correlation energy (PCE). The PCE value is computed using the MATLAB code provided by Goljan et al. [
23]. This approach calculates the Pearson correlation coefficient and then identifies the maximal value for a given sliding window [
23]. This is achieved by first computing the noise residuals of the hypothesis camera,
X, and the image,
Y, as a sum of PRNU,
K and secondary noises
:
The correlation is computed over shifted areas, where the maximal number of shifts is defined as the product of the difference in image
versus fingerprint dimensions
:
[
23], that is:
By definition, this implies that
is the correlation value for which the peak occurred for shift vector,
. In [
23], Goljan et al. suggested having the local peak area,
, to be an
pixel grid. This is given in Equation (
5):
3.3. Tampering Score: Face-Swap Verification
Detected faces can be directly verified for tampering. This is accomplished by identifying face centroid’s zone,
z, and applying a tampering score on it. Tampering score is generated by applying a tampering filter on the correlation energy. The high-pass cutoff is calibrated to ignore standard noise as tampering; the low-pass cutoff is calibrated to ignore images that do not match the source. This calibration is accomplished on a per-camera basis; in this case, a 1% false acceptance rate is prioritized. This face-zone-verification (FZV) score is described in Equation (
6):
The FZV score offers a significant run-time improvement at the cost of some robustness (as only the facial pixels are evaluated pixels). In theory, this can be expanded to a nearest neighbor approach and also to evaluate all zones directly adjacent to the centroid zone
z.
3.4. Tampering Score: Service-Denial Verification
This tampering score can be used to verify full image integrity in a zonal-expected-value (ZEV) fashion to mitigate face removal. The tampering filter can be applied on each zone, designated
, where a final filter is applied on the averaged output,
. Each zone filter can be individually calibrated, though for pragmatism uniform cutoffs are assumed. The complete filtered score is described in Equation (
8):
If on average the image appears tampered across zones, it is then labeled as tampered. Otherwise, it is labeled authentic or different source, based upon whether it fails the high-pass or low-pass cut off, respectively.
3.5. Compression via Down-Sampling
Compression is introduced in the form of down-sampling to optimize run-time performance. Down-sampling is chosen because it is computationally cheap and implicitly behaves as an averaging filter when combining neighboring pixels. For reference, the enrollment image is extracted for PRNU at full scale and then down-sampled to maximally retain features; challenge images, however, are down-sampled first then extracted for PRNU to minimize run-time.
All images are evaluated using three compression settings. These are enumerated below, with the axis sampling rate provided in the form of (row, column):
Full-Scale Resolution (1 × 1)
Quarter-Scale Resolution (1/2 × 1/2)
Sixteenth-Scale Resolution (1/4 × 1/4)
4. Experimental Setup
The proposed framework is evaluated on a constructed data-set. This route is chosen in lieu of using a public data-set because the verification methodology requires knowledge of the source camera, with both authentic and tampered images. Hence, facial images are acquired using 3 Ras Pi cameras. This maintains same scene with same imaging sensor. This approach is used to minimize differences across images, e.g., when image regions are swapped between cameras, it should appear visibly identical. Note these experiments are for facial image authenticity verification, but for completeness a sensitivity analysis of compressed PRNU in source identification is also provided in the appendix (
Appendix A). In all cases, run-time is evaluated to demonstrate real-time applicability on a Dell Latitude E5570 laptop (single CPU core only). For simplicity, it is assumed that the image is already read into memory where only the PRNU calculation and the camera classification time are measured.
4.1. Exp 1: Direct Face-Swap Verification
Exp 1 evaluates where optimizations can be conducted for real-time face recognition by applying face-zone-verification (FZV). That is to say, in the images containing a face, the tampering score is only calculated on the zone containing the face’s centroid.
This experiment augments the Ras Pi Camera data-set by utilizing the same cameras to acquire 50 photos containing face-swap volunteers (per camera). Swaps are conducted in manual and artificial intelligence fashions. Manual swap is conducted by identifying the boundary of the detected face in both images and swapping them (resizing the face as appropriate). The artificial intelligence approach employs a well-known online tool, Reflect [
36], to do a landmark alignment and merge the faces together.
Face-swaps are conducted using faces from both the same camera and different cameras. This results in 20 swaps generated per swap use case, sampling rate and zone, for a total of 1440 tampered images generated.
Figure 4 shows an example of how this is performed using two test subjects. For reference, the face-swap regions range from
to
image width. This data-set is designated Ras Pi—Face Swap.
The FZV tampering score is then calculated for the zone containing the centroid of the detected face in each image. Security is also evaluated in an end-to-end perspective, whereby a simple face recognizer is leveraged to validate overall probability of spoofing. The face recognizer uses the histogram of oriented gradients (HOG) feature with a multi-layer perceptron (10 hidden nodes) to represent constrained, real-time applications.
4.2. Exp 2: Simulated Service-Denial Verification
After verifying that compressed PRNU can sufficiently verify face integrity, Exp 2 evaluates a simulation of service denial, where faces are removed (i.e., there is no face to detect). Worst-case analysis is conducted by tampering images from the Ras Pi Camera data-set (i.e., all cameras of the same model, where images are of matching scenes).
Face removal across the image is simulated by randomly swapping blobs using matching scenes across images. That is to say, a blob of matching background is placed, mimicking the effect of removing faces. The swap shape chosen is a circle to mimic the shape of a face. A swap example is shown in
Figure 5; note that the red outlines shown only identify the swap region and are not present in the actual tampered images.
Swaps are performed in the image center, top left quadrant, top right quadrant, bottom left quadrant and bottom right quadrant. Swap size is varied by taking radii from
th image width to
th image width. These locations are indicated in
Figure 5. For reference, this translates to a worst-case radius range of 9 to 40 pixels at the sixteenth down-sample condition.
This generates 20 images per swap location, per sample rate for a total of 1500 tampered images. Tampering score is evaluated for all images by comparing the tampered image against the expected source. Performance is defined as the mean average precision of labeling images as tampered or authentic. This data-set is designated Ras Pi—General Swap.
The ZEV tampering score is then calculated by computing an expected value across all zones in the image. Again, the purpose here is to simulate the detectability of face deletion across image locations (via matching blob-swaps).
4.3. Open-Source Benchmark Algorithms
The noise-verification framework is benchmarked against three open-source algorithms. The algorithms are selected because they are the best performing relevant algorithms on GitHub. Note that the FZV and ZEV algorithms are written in MATLAB; however, the open-source algorithms are found written in Python. For pragmatic reasons, the code is used as is. This has no bearing on accuracy, but it gives a slight run-time evaluation to the proposed methods (MATLAB is optimized for matrix multiplication).
The first algorithm is a deep learning approach leveraging error level analysis (ELA) [
37]. Gunawan et al. developed this approach by transforming images using ELA and feeding a lightweight, convolutional neural network examples of authentic and spliced images from an open-source tampering data-set [
38]. For this evaluation, the network is re-trained on the same experimental data. This algorithm is referred to as “ELA CNN”.
The second algorithm comes from an ensemble image forgery detection toolbox [
39]. Levandoski and Lobo presented deterministic algorithms for re-compression, color filter array anomalies, noise variance and generic image duplication due to copying [
39]. It is key to note that due to run-time issues, the color filter array analysis is omitted as the algorithm takes over 30 min per image on this research machine and is not viewed as real-time. Given there that is no calibration setting, the algorithm suite is run directly on the experimental data without modification. This algorithm is referred to as “Forgery Tool”.
The third algorithm is discrete wavelet transform analysis for blind image tampering [
40]. While in a slightly older paper (2009), Mahdian and Saic applied a methodology that aligns very similarly with this paper’s PRNU analysis, the key novelty is the inclusion of a priori camera knowledge. For this evaluation, a tampering threshold is calibrated on the same experimental data. This algorithm is referred to as “DWT”.
4.4. Research Limitations
It is acknowledged that this verification approach does require a priori knowledge of the camera intrinsics. Furthermore, it is relevant to note there are real-world noise factors that this study did not introduce. For example, social media applications may employ their own proprietary compression methods to streamline data transmission; Meij et al. have started this investigation specifically for the WhatsApp communication tool [
41], noting some degradation is to be expected. Additionally, real-world camera exterior camera applications, such as security monitoring, introduce environmental noises (e.g., ambient light, rain, dirt, dust, heat, etc.). Validating the impact of the noises would further justify this methodology as well as potentially identify other novel control methods.
5. Results
The two experiments are evaluated to validate the tampering score algorithms (against face-swaps and randomized blob-swaps). Classification performance is presented in the form of mean average precision in authenticity prediction. Analysis of down-sampling impact on PRNU features can be found in the appendix (
Appendix A).
5.1. Exp 1: Direct Face-Swap Verification
The first experiment evaluates whether a detected face can be directly verified for authenticity. The face-zone verification (FZV) algorithm is benchmarked on the Ras Pi—Face-swap data-set (1440 images constructed from no-swap, manual-swap and AI face-swaps [
36]). The FZV tampering score is evaluated only on the zone containing the face centroid, using 1, 16 and 100 zones. Face centroid zone is estimated using the MATLAB cascade object detector [
42].
Table 1 presents the benchmark face-swap-attack verification performance for the FZV and open-source algorithms. Authentic (no swap) image is represented by “Authentic”. For readability, the results table is simplified to represent all manual face-swap-attacks as “Manual Face-Swap” and all AI face-swap-attacks as “AI Face-Swap”. A full sensitivity analysis for the FZV algorithm can be seen in the appendix (
Appendix B), verifying the difference between same-camera and different-camera swap performance is small.
These results demonstrate the FZV approach generally outperforms the open-source algorithms for mitigating face-swap-attacks. In particular, the FZV algorithm is optimized at quarter-scale with 16 sub-zones (indicated by the †). This includes 100% accuracy with manual FSAs and general robustness to AI FSAs in an efficient manner (run-time is shown in
Section 5.3). Counter to intuition, the hypothesis of isolating the relevant pixels for noise analysis seems to reduce performance. The small number of pixels instead seems to produce a less reliable PRNU measurement, impacting tampering score precision. One possible way to improve performance with reasonable overhead is to apply a nearest neighbor approach, including all adjacent zones.
The open-source results reinforce that the AI face-swap detection is a significant challenge. At full-scale resolution, all three perform well. However, the aggressive calibration becomes problematic for down-sampled imagery. The deep learning approach (ELA CNN) [
37] degrades to approximately a coin flip; the two deterministic algorithms [
39,
40] effectively determine every image to be tampered (arguably worse). This shows the utility of using a noise-verification approach in lieu of traditional classification.
In all cases, the compressed AI face-swaps are difficult to precisely verify. Two hypotheses are proposed. First, fewer pixels are tampered due to the intelligent blending. Second, the landmark-merged face does not resemble the target enough to fool a reasonable face recognizer. The first hypothesis is challenging to mitigate; however, the second implies the identification algorithm can provide end-to-end security. Note that the actual identification would be conducted at full-scale resolution. The benefit of the noise-verification framework is to only perform the authenticity analysis at compression; this means the full precision of the identification algorithm can be utilized.
Appendix B verifies that only detectable face-swaps pass an implemented face recognition test.
5.2. Exp 2: Simulated Service-Denial Verification
The second experiment evaluates the detectability of simulated service-denial-attack. Service-denial-attacks imperceptibly remove faces, meaning the full image needs to be verified for authenticity. To this end, blob-swaps are performed for the sizes and locations shown in
Figure 5. The full-image, zonal-expected-value (ZEV) algorithm is evaluated on the Ras Pi—General Swap data-set (1500 images constructed from no-swap and blob-swap images), using 1, 16 and 100 zones.
Table 2 presents the benchmark service-denial-attack verification performance for the ZEV and open-source algorithms. Authentic (no swap) images are represented as “Authentic”. For readability, the results table is simplified to represent blob swap locations as “Service-Denial”. A full sensitivity analysis for the ZEV algorithm can be seen in the appendix (
Appendix B), verifying the difference across location is small.
The results demonstrate the ZEV approach also generally outperforms the open-source algorithms for mitigating service-denial-attacks. The ZEV optimizes performance at sixteenth-scale with 100 sub-zones (indicated by the ‡). 100% accuracy is achieved over all blob-swaps and authentic images with relative efficiency (run-time shown in
Section 5.3). This holds true over all blob swap sizes and locations, which are as small as
th image-width. The utility of the zonal analysis really presents itself here, where significant compression can be applied while retaining full robustness.
The open-source algorithms conversely show significant degradation with slight compression. The deep learning algorithm again seems to stabilize at a coin-flip, but generally speaking all are unreliable under heavy compression. There is at least less over-fitting this time, showing a smaller difference between authentic and blob-swap images. This is postulated to be a result of the very small swaps, where the lack of local analysis significantly deteriorates performances. This further validates the utility of the proposed framework when using compression.
5.3. Run-Time Analysis
Authenticity-verification algorithms must fit within the primary application latency requirements (e.g., facial recognition). This is particularly true when considering high-risk use cases, where sensor encryption is minimized to reduce time. To evaluate these impacts,
Table 3 quantifies the run-times of face-swap-attack and service-denial-attack verification. The noise-verification framework (NVF) algorithms (FZV and ZEV respectively) are both referred to as NVF here.
Table 3 benchmarks the run-time of the proposed and open-source algorithms. The optimized ZFZ algorithm is indicated by † and the optimized ZEV algorithm is indicated by ‡. These results demonstrate the noise-verification framework can provide a notable advantage in efficiency. When it comes to face-swap-attack mitigation, the optimal open-source algorithm would be DWT at full-scale. This takes 48.2 ms, in comparison to the ZEF taking 4.6 ms. This shows a substantial improvement, where the ZEF approach is ideal for doing per-frame analysis with negligible overhead.
Service-denial-attack mitigation is inherently more computationally expensive. Rather than just verifying the detected face, the full image must be verified for authenticity (e.g., face removal). Here, the ZEV algorithm again shows notable advantages in efficiency. The optimal open-source algorithm is the Forgery Tool at full scale, which takes 193.8 ms in compared to the ZEV’s 106.9 ms. While an improvement, this is still too slow for per-frame verification. Instead, it is suggested to do a periodic full-image authenticity challenge to verify a denial attack is not present.
6. Conclusions
In summary, this research presents a novel way to efficiently mitigate face-swap-attacks. Similar to spoofing attacks, face-swap-attacks enable imposters to fool authentication systems and avoid surveillance. Here, a noise-verification framework is presented to robustly verify facial image authenticity with minimal overhead. This is achieved by first characterizing the source camera for PRNU and then evaluating deviations from future images under compression. Experimental results show the proposed framework is both robust and significantly faster than the benchmarked algorithms.
The face-zone-verification algorithm can robustly mitigate face-swap-attacks. One hundred percent of images are correctly predicted as authentic and tampered at full-scale. Performance is then optimized by applying quarter-scale down-sampling with 16 sub-zones; this achieves 93.5% accuracy while taking only 4.6 ms on CPU (a 99.1% reduction). In comparison, the optimal benchmark algorithm takes 48.2 ms—a significant reduction. This is fast enough to be conducted on a per-frame basis with negligible overhead to the face recognition application. If greater security is necessary, the full-scale imagery with 100 sub-zones achieves 100% verification accuracy with a small increase in run-time.
To mitigate service-denial by removing faces (e.g., surveillance avoidance), a periodic zonal-expected-value challenge is also recommended. The optimized algorithm (sixteenth-scale with 100 sub-zones) achieves 100.0% verification accuracy while taking 106.9 ms. While notably faster than the optimal benchmark algorithm (193.8 ms), this is still too slow for a per-frame analysis. Instead, this is best served as a periodic security challenge. It suggests the best times to do the ZEV algorithm are upon application initialization and whenever no face is detected for extended periods.
Future Directions
Our future research aims to investigate if the developed models can be generalized to no longer require a priori knowledge of the camera. The deviation in “noiseprint” that is caused by performing swaps or blending images together could be identified via anomaly detection methods. This approach would likely require deep learning, where it is likely that an AI system-on-chip would be necessary for real-time deployment. With this said, naive PRNU-based tampering detection would have tremendous value in media distribution.