Media Forensics Considerations on DeepFake Detection with Hand-Crafted Features
Abstract
:1. Introduction
- Using hand-crafted features for DeepFake detection and comparison with the performance of state-of-the-art deep learning-driven approaches, we discuss three sets of hand-crafted features and three different fusion strategies to implement DeepFake detection. Those features analyze the blinking behavior, the texture of the mouth region as well as the degree of texture found in the image foreground. Our tests on three pre-existing reference databases show detection performances that are under comparable test conditions to those of state-of-the-art methods using learned features (in our case obtaining a maximum AUC of 0.960 in comparison to a maximum AUC of 0.998 for a recent approach using convolutional neural networks). Furthermore, our approach shows a similar, if not better, generalization behavior (i.e., AUC drops from values larger than 0.9 to smaller than 0.7) than neural network based methods in tests performed with different training and test sets.In addition to those detection performance issues, we discuss at length that the main advantage which hand-crafted features have over learned features is their interpretability and the consequences this might have for plausibility validation for decisions made.
- Projection onto a forensic process model: With the aim to improve the maturity of pattern recognition-driven media forensics, we perform first steps of the projection of our work onto an established forensic process model. For this, a derivative of the forensic process model for IT forensics published in 2011 by the German Federal Office for Information Security (BSI) is used here. This derivative, or more precisely extension, is called the Data-Centric Examination Approach (DCEA) and has seen its latest major overhaul in 2020 in [7]. While it is not yet perfectly capable of fitting the needs of media forensics analyses, our work shows first benefits of this modeling as well as points where DCEA would need to undergo further extension to fit those purposes.
2. Background and State of the Art
2.1. DeepFake Detection
2.2. Feature Space Design Alternatives
- (a)
- Features are especially designed (so-called hand-crafted) by domain experts for an application scenario in a process, which, despite the fact that it is sometimes also called intuition-based feature design, usually requires strong domain knowledge. Here, the domain expert uses his/her own experience to construct the features to encode his/her own knowledge about the semantics (and internal as well as external influence factors) inherent to the different pattern classes in the problem at hand. As a result, usually rather low-dimensional feature spaces are designed, which require only small sets of training data (or none at all) for the training (i.e., adaptation/calibration) to a specific application scenario. The semantic characteristics intrinsic to these feature spaces can easily be exploited to validate decisions made using such a feature space.Such features can also be the result of the transfer of features from other, related or similar pattern processing problems.
- (b)
- Feature spaces are g by methods such as neural networks, where a structure (or architecture) for the feature space is designed (or chosen from a set of known goods) and then labelled training specimens are used to train the network from scratch or re-train an already existing network in transfer learning. The inherent characteristic of this process is that it requires very large sets of labelled, representative data for the training of the network (a little less so in case of transfer learning). The resulting feature spaces and trained models usually lack the encoding of easily interpretable semantics.
- In the case of only small amounts of training data being available (which seems to be a problem encountered often in medical data analysis problems, including clinical studies where “the recruitment of a large number of patients or collection of large number of images is often impeded by patient privacy, limited number of disease cases, restricted resources, funding constraints or number of participating institutions” [29]), the classification performance of hand-crafted features (which usually show persistent detection performances with small training datasets) outperformed their feature spaces learned by neural networks. This is hardly astonishing since it is well known that CNNs require a large amount of training data for reliable imaging classification. This situation changes with increasing training dataset sizes.
- Another advantage of hand-crafted features is interpretability. Lin et al. summarize this issue as follows: “Therefore, interpretability of [hand-crafted] features reveal why liver [magnetic resonance] images are classified as suboptimal or adequate” [29], i.e., these features allow for expert reasoning on errors, loss or uncertainty in decision making.
- Feature selection strategies help learning about significance and contextual relationship for hand-crafted features, while they fail to produce interpretable results for learned features.
2.3. A Data-Centric Examination Approach for Incident Response and Forensic Process Modeling
3. Solution Concept for DeepFake Detection with Hand-Crafted Features
- The design, implementation and empirical evaluation of features for DeepFake detection: Here, two feature spaces hand-crafted especially for DeepFake detection and a hand-crafted feature space derived from a different but similar pattern recognition problem domain (face morph detection) are implemented and evaluated. For the empirical evaluation, pre-existing reference databases containing DeepFake as well as benign ("original") face video sequences are used together with a pre-existing out of the box classification algorithm implementation. To facilitate the interpretation of results and the comparability with other detector performances reported in the state of the art, different well established metrics are used: detection accuracy, Cohen’s kappa as well as (ROC) AUC (Area Under the Curve (of the Receiver Operating Characteristic)).
- The discussion of different information fusion techniques and the empirical comparison with detection performances of individual classifiers: Here, with feature-level fusion and decision-level fusion, two different concepts are applied. For the latter, with the majority voting and weighted linear combination, two popular choices are used and compared with single classifiers in terms of the classification performance achieved.
- The comparison of the detection performance of our hand-crafted features with performances of learned feature spaces from the state of the art in this field: Here, the results obtained by single classifiers as well as fusion approaches are compared in terms of detection accuracy with different approaches from the state of the art, relying on learned features.
- Attempts at validating the detectors’ decisions on basis of the features and trained models: Some classifiers, such as the decision tree algorithm used in this paper, train models that can be read, interpreted and compared by humans. Here, we analyze the decision trees trained on different training sets to identify the most relevant features and see how much these trees have in common and where they differ.
4. Implementation of the Individual Detectors and the Fusion Operators
4.1. Individual Detectors Using Hand-Crafted Features
4.1.1. DeepFake Detection Based on Eye Blinking
4.1.2. DeepFake Detection Based on Mouth Region
4.1.3. DeepFake Detection Based on Image Foreground
4.2. Fusion Operators
- Feature-level fusion: concatenation of all features;
- Decision-level fusion: simple majority voting;
- Decision-level fusion: weighted, based on accuracy using TIMIT-DF for training;
- Decision-level fusion: weighted, based on accuracy using DFD for training;
- Decision-level fusion: weighted, based on accuracy using Celeb-DF for training.
5. Evaluation Results
5.1. Results for Individual Detectors
5.2. Results for Fusion Operators
6. Summary and Conclusions
6.1. Summary of the Results and Comparison with other Approaches from the State of the Art
6.1.1. Performances and Generalization Power
6.1.2. Comparison of Feature Concepts
6.2. Comparison of Hand-Crafted and Learned Features for DeepFake Detection and Conclusions
7. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Collection of Features Proposed in this Paper
ID | Feature | Description |
---|---|---|
ID1fusion ID1blink | Maximum AspectRatio difference between both eyes. | The expected difference is close to 0, whereby a larger distance is suspected as an indication of a DeepFake. Additionally, the absence of winking is required for this feature. |
ID2fusion ID2blink | Absolute maximum AspectRatio rate of change for the left eye. | Based on several studies the eyelid movement varies based on different aspects, e.g., age and gender [24,45]. Nevertheless, the maximum speeds, as well as the relation of opening and closing speeds, could be an indication for DeepFake detection. This rate of change for each frame is determined by the difference between previous and following frame. Normalization is carried out by multiplying the rate of change by the frame rate of the video. This results in the AspectRatio change every 3 seconds, described as . The suitability of these features is based on the disregard of blink behavior in DeepFake synthesis. |
ID3fusion ID3blink | Maximum AspectRatio rate of change for the left eye. Maximum opening speed of the left eye. | |
ID4fusion ID4blink | Minimum AspectRatio rate of change for the left eye. Maximum closing speed of the left eye. | |
ID5fusion ID5blink | Absolute maximum AspectRatio rate of change for the right eye. | |
ID6fusion ID6blink | Maximum AspectRatio rate of change for the right eye. Maximum opening speed of the right eye. | |
ID7fusion ID7blink | Minimum AspectRatio rate of change for the right eye. Maximum closing speed of the right eye. | |
ID8fusion ID8blink | Noise count in the eye state signal. | Noise is defined as a rapid change of eye state, where one state lasts for a maximum of 0.08 seconds. A higher number of these noises is expected for DeepFakes. |
ID9fusion ID9blink | Percentage of video time at which the state open is classified. | Another feature that can be justified by studies about human blinking behavior [24,45]. Assuming a healthy person in a non-manipulated video, on average a value of about 0.9 should be expected. |
ID10fusion ID10blink | Minimum duration detected for the eye state open in seconds. | Features based on the durations of the states are again based on the knowledge of human blinking behavior. It is assumed that the eyes are open longer than they are closed. As a conclusion ID12blink < ID10blink and ID13blink < ID11blink are expected. |
ID11fusion ID11blink | Maximum duration detected for the eye state open in seconds. | |
ID12fusion ID12blink | Minimum duration detected for the eye state closed in seconds. | |
ID13fusion ID13blink | Maximum duration detected for the eye state closed in seconds. | |
ID14fusion ID1mouth | Absolute maximum rate of change in y-dimension. | This rate of change for each frame is determined by the difference between previous and following frame. Normalization is carried out by multiplying the rate of change by the frame rate of the video. This results in the AspectRatio change every 3 s, described as . For these features, a maximum speed is assumed, which is determined by training the model. Exceeding this maximum speed is assumed to be an indication for the classification DeepFake. Limitation: only works with videos where the person moves their lips during the video, e.g., when speaking. |
ID15fusion ID2mouth | Maximum rate of change in y-dimension. Lip opening movement in y-dimension. | |
ID16fusion ID3mouth | Minimum rate of change in y-dimension. Lip closing movement in y-dimension. | |
ID17fusion ID4mouth | Absolute maximum rate of change in x-dimension. | |
ID18fusion ID5mouth | Maximum rate of change in x-dimension. Lip opening movement in x-dimension. | |
ID19fusion ID6mouth | Minimum rate of change in x-dimension. Lip closing movement in x-dimension. | |
ID20fusion ID7mouth | Percentage of video time at which the state open without teeth is classified. | The assumption for feature ID7mouth is that DeepFakes are more often classified in this state compared to non-manipulated videos. The cause is the blending subprocess in the creation of DeepFakes, which leads to a loss of information and detail in the mouth region due to smoothing. As a consequence, DeepFakes are assumed to have both a comparatively low level of detail due to said blending and a comparatively high level of detail due to possible misclassification of open with teeth as open without teeth. Normalization takes place relative to the number of pixels in the TR (see Figure 4). Default value is set to -1 to be outside the considered range. |
ID21fusion ID8mouth | Maximum number of regions based on all frames of the video for state open without teeth. | |
ID22fusion ID9mouth | Maximum number of FAST keypoints based on all frames of the video for state open without teeth. | |
ID23fusion ID10mouth | Maximum number of SIFT keypoints based on all frames of the video for state open without teeth. | |
ID24fusion ID11mouth | Maximum number of Sobel edge pixels based on all frames of the video for state open without teeth. | |
ID25fusion ID12mouth | Percentage of video time at which the state open with teeth is classified. | The assumption for feature ID12mouth is that non-manipulated videos are more often classified in this state compared to DeepFakes. The cause is the blending subprocess in the creation of DeepFakes, which leads to a loss of information and detail in the mouth region due to smoothing. As a consequence, DeepFakes are assumed to have a comparatively low level of detail due to said blending. Normalization takes place relative to the number of pixels in the TR (see Figure 4). Default value is set to −1 to be outside the considered range. |
ID26fusion ID13mouth | Minimum number of regions based on all frames of the video for state open with teeth. | |
ID27fusion ID14mouth | Minimum number of FAST keypoints based on all frames of the video for state open with teeth. | |
ID28fusion ID15mouth | Minimum number of SIFT keypoints based on all frames of the video for state open with teeth. | |
ID29fusion ID16mouth | Minimum number of Sobel edge pixels based on all frames of the video for state open with teeth. | |
ID30fusion ID1foreground | Total number of frames in the video without a detectable face. | The consideration of these features is made under the assumption that DeepFake synthesis could result in artifacts, causing the face detection to fail. Normalization is relative to the number of frames of the video to ensure comparability regardless of the video length. |
ID31fusion ID2foreground | Total number of segments in the video without a detectable face. | |
ID32fusion ID3foreground | Maximum number of FAST keypoints based on all frames of the video for the image foreground. | The assumption for this set of features is that an almost constant value can be found throughout the course of the video. As a result, no significant differences between minimum and maximum of each feature are expected. Greater distances are seen as an indication of DeepFakes. Normalization is carried out on the basis of the two representations Face and ROI (see Figure 6 for reference) based on the level of detail as well as the number of pixels. Formally, this takes the form of , where . In order to prevent division by 0, the default value is set to −1 to be outside the considered range. |
ID33fusion ID4foreground | Minimum number of FAST keypoints based on all frames of the video for the image foreground. | |
ID34fusion ID5foreground | Maximum number of SIFT keypoints based on all frames of the video for the image foreground. | |
ID35fusion ID6foreground | Minimum number of SIFT keypoints based on all frames of the video for the image foreground. | |
ID36fusion ID7foreground | Maximum number of Sobel edge pixel based on all frames of the video for the image foreground. | |
ID37fusion ID8foreground | Minimum number of Sobel edge pixel based on all frames of the video for the image foreground. |
References
- Chesney, R.; Citron, D. Deepfakes and the new disinformation war: The coming age of post-truth geopolitics. Foreign Aff. 2019, 98, 147. [Google Scholar]
- Vaccari, C.; Chadwick, A. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news. Soc. Media Soc. 2020, 6. [Google Scholar] [CrossRef] [Green Version]
- Palmer, G.L. A Road Map for Digital Forensics Research—Report from the First Digital Forensics Research Workshop (DFRWS) (Technical Report DTR-T001-01 Final); Technical Report; Air Force Research Laboratory, Rome Research Site: Utica, NY, USA, 2001. [Google Scholar]
- Champod, C.; Vuille, J. Scientific Evidence in Europe—Admissibility, Evaluation and Equality of Arms. Int. Comment. Evid. 2011, 9. [Google Scholar] [CrossRef] [Green Version]
- Krätzer, C. Statistical Pattern Recognition for Audio-forensics—Empirical Investigations on the Application Scenarios Audio Steganalysis and Microphone Forensics. Ph.D. Thesis, Otto-von-Guericke-University, Magdeburg, Germany, 2013. [Google Scholar]
- U.S. Congress. Federal Rules of Evidence; Amended by the United States Supreme Court in 2021; Supreme Court of the United States: Washington, DC, USA, 2021.
- Kiltz, S. Data-Centric Examination Approach (DCEA) for a Qualitative Determination of Error, Loss and Uncertainty in Digital and Digitised Forensics. Ph.D. Thesis, Otto-von-Guericke-University, Magdeburg, Germany, 2020. [Google Scholar]
- Böhme, R.; Freiling, F.C.; Gloe, T.; Kirchner, M. Multimedia forensics is not computer forensics. In Computational Forensics; Geradts, Z.J.M.H., Franke, K.Y., Veenman, C.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 90–103. [Google Scholar]
- Bondi, L.; Cannas, E.D.; Bestagini, P.; Tubaro, S. Training Strategies and Data Augmentations in CNN-based DeepFake Video Detection. arXiv 2020, arXiv:2011.07792. [Google Scholar]
- FakeApp 2.2.0. Available online: https://www.malavida.com/en/soft/fakeapp (accessed on 30 June 2021).
- Nguyen, T.T.; Nguyen, C.M.; Nguyen, D.T.; Nguyen, D.T.; Nahavandi, S. Deep Learning for Deepfakes Creation and Detection. arXiv 2021, arXiv:1909.11573. [Google Scholar]
- Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, Seattle, WA, USA, 14–19 June 2020; pp. 3204–3213. [Google Scholar]
- Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, I.; Natarajan, P. Recurrent Convolutional Strategies for Face Manipulation Detection in Videos. arXiv 2019, arXiv:1905.00582. [Google Scholar]
- Li, Y.; Chang, M.; Lyu, S. In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking. arXiv 2018, arXiv:1806.02877. [Google Scholar]
- Korshunov, P.; Marcel, S. DeepFakes: A New Threat to Face Recognition? Assessment and Detection. arXiv 2018, arXiv:1812.08685. [Google Scholar]
- Tolosana, R.; Vera-Rodríguez, R.; Fiérrez, J.; Morales, A.; Ortega-Garcia, J. DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection. arXiv 2020, arXiv:2001.00179. [Google Scholar]
- Li, Y.; Lyu, S. Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv 2018, arXiv:1811.00656. [Google Scholar]
- Matern, F.; Riess, C.; Stamminger, M. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar] [CrossRef]
- Yu, P.; Xia, Z.; Fei, J.; Lu, Y. A Survey on Deepfake Video Detection. IET Biom. 2021. [Google Scholar] [CrossRef]
- Yang, X.; Li, Y.; Lyu, S. Exposing Deep Fakes Using Inconsistent Head Poses. arXiv 2018, arXiv:1811.00661. [Google Scholar]
- McCloskey, S.; Albright, M. Detecting GAN-generated Imagery using Color Cues. arXiv 2018, arXiv:1812.08247. [Google Scholar]
- Agarwal, S.; Farid, H.; Gu, Y.; He, M.; Nagano, K.; Li, H. Protecting World Leaders Against Deep Fakes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Jung, T.; Kim, S.; Kim, K. DeepVision: Deepfakes Detection Using Human Eye Blinking Pattern. IEEE Access 2020, 8, 83144–83154. [Google Scholar] [CrossRef]
- Ciftci, U.A.; Demir, I. FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals. arXiv 2019, arXiv:1901.02212. [Google Scholar] [CrossRef]
- Verdoliva, L. Media Forensics and DeepFakes: An overview. arXiv 2020, arXiv:2001.06564. [Google Scholar]
- Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [Green Version]
- Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; Müller, K.R. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2660–2673. [Google Scholar] [CrossRef] [Green Version]
- Lin, W.; Hasenstab, K.; Cunha, G.M.; Schwartzman, A. Comparison of handcrafted features and convolutional neural networks for liver MR image adequacy assessment. Sci. Rep. 2020, 10, 20336. [Google Scholar] [CrossRef] [PubMed]
- Sánchez-Maroño, N.; Alonso-Betanzos, A.; Tombilla-Sanromán, M. Filter Methods for Feature Selection—A Comparative Study. In Intelligent Data Engineering and Automated Learning—IDEAL 2007; Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 178–187. [Google Scholar]
- Law, M.; Figueiredo, M.; Jain, A. Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1154–1166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kiltz, S.; Dittmann, J.; Vielhauer, C. Supporting Forensic Design—A Course Profile to Teach Forensics. In Proceedings of the 2015 Ninth International Conference on IT Security Incident Management and IT Forensics, Magdeburg, Germany, 18–20 May 2015; pp. 85–95. [Google Scholar]
- Altschaffel, R. Computer Forensics in Cyber-Physical Systems: Applying Existing Forensic Knowledge and Procedures from Classical IT to Automation and Automotive. Ph.D. Thesis, Otto-von-Guericke-University, Magdeburg, Germany, 2020. [Google Scholar]
- Kiltz, S.; Hoppe, T.; Dittmann, J. A New Forensic Model and Its Application to the Collection, Extraction and Long Term Storage of Screen Content off a Memory Dump. In Proceedings of the 16th International Conference on Digital Signal Processing, DSP’09, Santorini, Greece, 5–7 July 2009; IEEE Press: New York, NY, USA, 2009; pp. 1135–1140. [Google Scholar]
- Sagonas, C.; Antonakos, E.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. 300 Faces In-The-Wild Challenge. Image Vis. Comput. 2016, 47, 3–18. [Google Scholar] [CrossRef] [Green Version]
- King, D.E. Dlib-ml: A Machine Learning Toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
- 2d Face Sets—Utrecht ECVP. Available online: http://pics.stir.ac.uk/2D_face_sets.htm (accessed on 19 May 2021).
- Makrushin, A.; Neubert, T.; Dittmann, J. Automatic generation and detection of visually faultless facial morphs. In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications—Volume 6: VISAPP, (VISIGRAPP 2017), INSTICC, Porto, Portugal, 27 February–1 March 2017; SciTePress: Setubal, Portugal, 2017; pp. 39–50. [Google Scholar] [CrossRef]
- Kraetzer, C.; Makrushin, A.; Neubert, T.; Hildebrandt, M.; Dittmann, J. Modeling Attacks on Photo-ID Documents and Applying Media Forensics for the Detection of Facial Morphing. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec ’17, Philadelphia, PA, USA, 20–22 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 21–32. [Google Scholar] [CrossRef]
- Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. SIGKDD Explor. 2009, 11, 10–18. [Google Scholar] [CrossRef]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
- Sanderson, C.; Lovell, B. Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference. LNCS 2009, 5558, 199–208. [Google Scholar] [CrossRef] [Green Version]
- Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces. arXiv 2018, arXiv:1803.09179. [Google Scholar]
- Dufour, N.; Gully, A.; Karlsson, P.; Vorbyov, A.V.; Leung, T.; Childs, J.; Bregler, C. DeepFakes Detection Dataset by Google & JigSaw. Available online: https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html (accessed on 19 May 2021).
- Wubet, W.M. The Deepfake Challenges and Deepfake Video Detection. Int. J. Innov. Technol. Explor. Eng. 2020, 9. [Google Scholar] [CrossRef]
- DeBruine, L.; Jones, B. Face Research Lab London Set. Available online: https://figshare.com/articles/dataset/Face_Research_Lab_London_Set/5047666/1 (accessed on 19 May 2021).
- Agarwal, S.; Farid, H.; Fried, O.; Agrawala, M. Detecting Deep-Fake Videos From Phoneme-Viseme Mismatches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Bradski, G. The OpenCV Library. Dobb J. Softw. Tools 2000, 120, 122–125. [Google Scholar]
- Ross, A.A.; Nandakumar, K.; Jain, A.K. Levels of Fusion in Biometrics. In Handbook of Multibiometrics; Springer: Boston, MA, USA, 2006; pp. 59–90. [Google Scholar] [CrossRef]
- Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar] [CrossRef]
- Rana, S.; Ridwanul, M. DeepFake Audio Detection. GitHub Repos. Available online: https://github.com/dessa-oss/fake-voice-detection (accessed on 30 May 2021).
- Zhang, W.; Zhao, C.; Li, Y. A Novel Counterfeit Feature Extraction Technique for Exposing Face-Swap Images Based on Deep Learning and Error Level Analysis. Entropy 2020, 22, 249. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, W.; Zhao, C. Exposing Face-Swap Images Based on Deep Learning and ELA Detection. Proceedings 2020, 46, 29. [Google Scholar] [CrossRef] [Green Version]
- Krawetz, N. A Picture ’ s Worth…Digital Image Analysis and Forensics. In Proceedings of the Black Hat Briefings 2007, Las Vegas, NV, USA, 28 July–2 August 2007; pp. 1–31. [Google Scholar]
Sets of Examination Steps | Description (According to [7]) |
---|---|
Strategic preparation (SP) | Includes measures taken by the operator of an IT system and by the forensic examiners in order to support a forensic investigation prior to an incident |
Operational preparation (OP) | Includes measures of preparation for a forensic investigation after the detection of a suspected incident |
Data gathering (DG) | Includes measures to acquire and secure digital evidence |
Data investigation (DI) | Includes measures to evaluate and extract data for further investigation |
Data analysis (DA) | Includes measures for detailed analysis and correlation between digital evidence from various sources |
Documentation (DO) | Includes measures for the detailed documentation of the proceedings, also for the transformation into a different form of description for the report of the incident |
Forensic Data Type | Description (According to [7]) |
---|---|
Raw data | A sequence of bits or data streams of system components not (yet) classified |
Hardware data | Data not or only in a limited way influenced by the OS and application |
Details about data | Meta data describing other data |
Configuration data | Modify the behavior of the system and applications |
Communication protocol data | Modify the communication behavior of the system |
Process data | Data about a running process |
Session data | Data collected by a system during a session |
User data | Content created, edited or consumed by the user |
Forensic Data Type | Description (According to [7]) |
---|---|
Raw sensor data (DD1) | Digital input data from the digitalization process (e.g., scans of test samples) |
Processed signal data (DD2) | Results of transformations to raw sensor data (e.g., visibility enhanced fingerprint pattern) |
Contextual data (DD3) | Contain environmental data (e.g., spatial information, spatial relation between traces, temperature, humidity) |
Parameter data (DD4) | Contain settings and other parameters used for acquisition, investigation and analysis |
Trace characteristic feature data (DD5) | Describe trace specific investigation results (e.g., level 1/2/3 fingerprint features) |
Substrate characteristic feature data (DD6) | Describe trace carrier specific investigation results (e.g., surface type, individual surface characteristics) |
Model data (DD7) | Describe trained model data (e.g., surface specific scanner settings, reference data) |
Classification result data (DD8) | Describes classification results gained by applying machine learning and comparable approaches |
Chain of custody data (DD9) | Describe data used to ensure integrity and authenticity and process accompanying documentation (e.g., cryptographic hash sums, certificates, device identification, time stamps) |
Report data (DD10) | Describe data for the process accompanying documentation and for the final report |
Sets of Methods for the Forensic Process in Digital Forensics | Description (According to [7]) |
---|---|
Operating system (OS) | Methods that provide forensically relevant data as well as serving their main purpose of distributing computing resources |
File system (FS) | Methods that provide forensically relevant data as well as serving their main purpose of maintaining the file system |
IT application (ITA) | Methods provided by IT applications that provide forensically relevant data as well as serving their main purpose |
Explicit means of intrusion detection (EMID) | Methods that are executed autonomous on a routine basis and without a suspicion of an incident |
Scaling of methods for evidence gathering (SMG) | Methods that are unsuited for routine usage in a production environment (e.g., due to false positives or high computation power requirements) |
Data processing and evaluation (DPE) | Dedicated methods of the forensic process that display, process or document information |
Dataset | Number of Videos | Reference |
---|---|---|
VidTIMIT | 430 * | [42] |
TIMIT-DF | 640 | [16,42] |
DFD-source | 55 * | [12,44] |
DFD-DF | 55 * | [12,44] |
Celeb-YouTube | 60 * | [13] |
Celeb-real | 60 * | [13] |
Celeb-DF (v2) | 120 * | [13] |
↓ Training/Testing → | TIMIT-DF | DFD | Celeb-DF |
---|---|---|---|
TIMIT-DF | scenario 1 (S1) | scenario 2 (S2) | scenario 3 (S3) |
DFD | scenario 4 (S4) | scenario 5 (S5) | scenario 6 (S6) |
Celeb-DF | scenario 7 (S7) | scenario 8 (S8) | scenario 9 (S9) |
Training Dataset → | TIMIT-DF [16,42] | DFD [12,44] | Celeb-DF [13] | ||||||
---|---|---|---|---|---|---|---|---|---|
↓ proposed method test dataset → | TIMIT-DF | DFD | Celeb-DF | TIMIT-DF | DFD | Celeb-DF | TIMIT-DF | DFD | Celeb-DF |
DeepFake detection based on | 82.15% | 50.00% | 57.50% | 58.32% | 59.09% | 52.92% | 62.06% | 58.18% | 69.17% |
eye blinking | (0.63) | (0.00) | (0.15) | (0.15) | (0.18) | (0.06) | (0.07) | (0.16) | (0.38) |
DeepFake detection based on | 94.21% | 76.36% | 56.67% | 64.95% | 96.36% | 53.75% | 67.85% | 69.09% | 94.58% |
mouth region | (0.88) | (0.53) | (0.13) | (0.29) | (0.93) | (0.08) | (0.34) | (0.38) | (0.89) |
DeepFake detection based on | 73.36% | 53.64% | 56.67% | 58.33% | 73.64% | 54.02% | 60.83% | 54.55% | 70.83% |
image foreground | (0.42) | (0.07) | (0.13) | (0.17) | (0.47) | (0.11) | (0.13) | (0.09) | (0.42) |
Feature-level fusion | 97.57% | 66.36% | 54.58% | 65.05% | 97.27% | 56.25% | 63.74% | 60.00% | 94.17% |
(0.95) | (0.33) | (0.09) | (0.30) | (0.95) | (0.13) | (0.27) | (0.20) | (0.88) | |
Decision-level fusion: | 91.03% | 69.09% | 58.75% | 59.72% | 61.18% | 52.08% | 65.42% | 62.73% | 86.25% |
simple majority voting | (0.81) | (0.38) | (0.18) | (0.24) | (0.24) | (0.04) | (0.20) | (0.25) | (0.73) |
Decision-level fusion: | 94.77% | 70.91% | 57.50% | 67.00% | 95.45% | 53.75% | 68.04% | 65.45% | 90.42% |
weighted (threshold=0.65) | (0.89) | (0.42) | (0.15) | (0.33) | (0.91) | (0.08) | (0.34) | (0.31) | (0.81) |
Training Dataset → | DeepFakeDetection (DFD) [12,44] | Celeb-DF [13] | ||||
---|---|---|---|---|---|---|
↓ fusion method test dataset → | TIMIT-DF | DFD | Celeb-DF | TIMIT-DF | DFD | Celeb-DF |
Ours: simple majority | 0.668 | 0.947 | 0.556 | 0.690 | 0.685 | 0.925 |
Ours: weighted based on accuracy | 0.685 | 0.960 | 0.556 | 0.682 | 0.712 | 0.954 |
using DFD for training | ||||||
Ours: weighted based on accuracy | 0.685 | 0.960 | 0.556 | 0.698 | 0.712 | 0.955 |
using Celeb-DF for training | ||||||
[9]: Baseline | - | 0.987 | 0.754 | - | 0.708 | 0.998 |
[9]: Triplet Training | - | 0.882 | 0.759 | - | 0.554 | 0.995 |
[9]: EfficientNetB4. Binary Cross | ||||||
Entropy with augmentation | - | 0.990 | 0.842 | - | 0.795 | 0.998 |
[9]: EfficientNetB4. Triplet Loss | ||||||
with augmentation | - | 0.982 | 0.809 | - | 0.604 | 0.995 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Siegel, D.; Kraetzer, C.; Seidlitz, S.; Dittmann, J. Media Forensics Considerations on DeepFake Detection with Hand-Crafted Features. J. Imaging 2021, 7, 108. https://doi.org/10.3390/jimaging7070108
Siegel D, Kraetzer C, Seidlitz S, Dittmann J. Media Forensics Considerations on DeepFake Detection with Hand-Crafted Features. Journal of Imaging. 2021; 7(7):108. https://doi.org/10.3390/jimaging7070108
Chicago/Turabian StyleSiegel, Dennis, Christian Kraetzer, Stefan Seidlitz, and Jana Dittmann. 2021. "Media Forensics Considerations on DeepFake Detection with Hand-Crafted Features" Journal of Imaging 7, no. 7: 108. https://doi.org/10.3390/jimaging7070108
APA StyleSiegel, D., Kraetzer, C., Seidlitz, S., & Dittmann, J. (2021). Media Forensics Considerations on DeepFake Detection with Hand-Crafted Features. Journal of Imaging, 7(7), 108. https://doi.org/10.3390/jimaging7070108