Digital Video Tampering Detection and Localization: Review, Representations, Challenges and Algorithm

: Digital videos are now low-cost, easy to capture and easy to share on social media due to the common feature of video recording in smart phones and digital devices. However, with the advancement of video editing tools, videos can be tampered (forged) easily for propaganda or to gain illegal advantages—ultimately, the authenticity of videos shared on social media cannot be taken for granted. Over the years, signiﬁcant research has been devoted to developing new techniques for detecting different types of video tampering. In this paper, we offer a detailed review of existing passive video tampering detection techniques in a systematic way. The answers to research questions prepared for this study are also elaborated. The state-of-the-art research work is analyzed extensively, highlighting the pros and cons and commonly used datasets. Limitations of existing video forensic algorithms are discussed, and we conclude with research challenges and future directions.


Introduction
The availability of sophisticated low-cost digital video cameras in mobile phones, gadgets, and a large number of video-sharing websites such as (YouTube, Facebook, and Dailymotion) play an important role in daily life to disseminate and share visual information. The visual data can also serve as powerful evidence before a court of law to verify or support the testimony of a person being questioned. In the presence of sophisticated and user-friendly video-editing software, the genuineness of videos cannot be taken for granted. With advanced editing tools, information manipulation has become easy. Videos can be edited by inserting or deleting objects/events, with good or bad intentions [1].
Videos are not accepted without their forensic reports as a matter of evidence by law enforcement agencies. Every instance of video tampering does not have equal significance, e.g., tampered footage of a pop star is not as harmful as the tampered footage of a crime scene [2]. The film industry benefits from video editing technologies to add virtual reality in scenes. Video evidence is also important for news reporting, intelligence agencies, insurance companies, copywriting, criminal investigations, etc. Forensic analysis of videos and images is the focus of recent research to ensure the authenticity of multimedia content [3]. Such research is never ending due to the progressive advancement in video editing tools.
Progress in video tampering has a significant effect on our society. Although only a few digital video forgeries have been exposed, such instances have eroded public trust in video clips [4].
The objective of video tampering detection is to ensure the authenticity and to expose the potential modifications and forgeries (i.e., to verify whether a video is authentic or

Organization of This Study
The remaining part of this study is organized in the following sections. Section 2 describes this study's distinction from other survey papers. Section 3 elaborates the survey protocol of this study. Section 4 explains the types of video tampering (forgery). Section 5 provides the detail of video forensic detection approaches. Sections 6 and 7 elaborate the state-of-the-art spatial and temporal tampering detection techniques, datasets used, comparison, and limitations. Section 8 concludes the analysis, and challenges are highlighted. In Section 9, future directions are presented, and finally, Section 10 concludes this review.

Distinction from Other Surveys
Considering the fact that video tampering detection has been maturely developed and enough research work has been published on passive video detection techniques, a comprehensive analysis on proposed schemes for passive video tampering detection and localization is required to determine future research directions. To our knowledge, this is the first study of systematic literature surveys in the domain of passive video tampering detection.
Several researchers have reviewed video tampering (forgery) detection techniques. Details are shown in Table 1 of the works published so far. A few papers [8,9] published in reputable journals have focused on passive video tampering detection techniques. Journal papers [4,10,11] partially discussed the video techniques and their focus was on image tampering detection techniques. Review papers [12][13][14][15][16][17][18] are published in conferences and less reputable journals. Rocha et al. [4] reviewed the video forgery detection and localization issues by discussing two video tampering techniques, but without highlighting the pros and cons of video tampering detection techniques. Moreover, the major emphasis of this review was on image forensics rather than video forensics. Similarly, Milani et al. [11] discussed video acquisition and compression issues only. This review is also a partial and mixed representation of image and video forensic analysis. Pandey et al. [10] presented review on passive techniques of image and video tampering detection but only focused on techniques that are based on noise features. This survey highlighted that the video tampering detection domain is facing issues such as video acquisition, post-processing operations (compression, blurring, noise addition, geometric transformation) and robustness. Sharma et al. [17] reviewed passive techniques, but their discussion was limited to only copy-move attacks on digital videos.
Sitara et al. [9] also analyzed passive tampering detection methods and their limitations, but there is no discussion comparing the accuracy of these methods, which is an important part of our survey paper. Singh et al. in [8] reported that there are few video forgery detection methods that have been evaluated extensively because of the lack of availability of large-scale video forgery datasets and base lines for comparison of different video forgery detection techniques. There is a dire need for a comprehensive collection of videos for advanced evaluation of video forensic techniques; however, they highlighted the non-availability of a large-scale video forensic dataset only. Tao et al. in [18] and Mizher et al. in [19] reviewed video tampering detection in comprehensive ways, but these papers were published in 2017 and thus, several current state-of-the-art techniques are not considered.
Sharma et al. [20] reviewed existing video forgery detection techniques by their functionality. The review has strength in terms of exploring video forgery detection techniques by their functionality and datasets. Johnston et al. in [21] critically reviewed spatial video forgery detection techniques based on deep learning. Existing video tampering datasets used to evaluate video forgery detection techniques were also reviewed. The researchers highlighted the challenges and trends in video forgery detection in the spatial domain; however, the research gaps in the temporal domain of video forgery detection still need to be explored. In a recent survey, Kaur and Jindal [22] explored the current challenges and trends in the domain of image and video forensics. The review was focused on highlighting the image copy-move and splicing forgeries, and inter-and intra-frame video forgery challenges. Issues regarding benchmarking and datasets were also highlighted. This review presented both image and video forgery issues, but the major focus was on highlighting the issues in the image forensic domain, and few aspects related to video forgery forensic are elaborated. Recently, Shelke and Kasana [23] presented a comprehensive survey on passive techniques for video forgery detection based on features, types of forgeries identified, datasets and performance parameters. Pros and cons of different passive forgery detection techniques are elaborated, along with future challenges. Anti-forensics techniques, deep fake detection in videos and a brief review of existing datasets of video forgery are also included in this survey paper.

•
Almost all published papers in the domain of video forgery/tampering to date are considered to show the overall picture of research contribution in the field. • To our knowledge, this is the first systematic comprehensive survey to filter out rich research contributions in the domain.

•
This survey is categorized based on the proposed methodologies for easy comparison of their performance evaluation and the selection of the most suitable technique. • This review will be helpful to new researchers regarding the issues and challenges faced by the community in this domain. Moreover, this paper analyzes the research gaps found in the literature that will help future researchers to identify and explore new avenues in the domain of video forensics.

Survey Protocol
The objective of this systematic study is to perceive and arrange the strategies, models, methods, and tools that are used to investigate existing video tampering techniques. The procedure of systematic study helps us to analyze the available research in the subject domain. In this study, the guidelines of systematic literature survey [26] are followed and the survey protocol plan of this study is shown in Figure 1. The following subsections elaborate the research questions, search string and inclusion/exclusion criteria, extract data and present their analysis. This review will be helpful to new researchers regarding the issues and challenges faced by the community in this domain. Moreover, this paper analyzes the research gaps found in the literature that will help future researchers to identify and explore new avenues in the domain of video forensics.

Survey Protocol
The objective of this systematic study is to perceive and arrange the strategies, models, methods, and tools that are used to investigate existing video tampering techniques. The procedure of systematic study helps us to analyze the available research in the subject domain. In this study, the guidelines of systematic literature survey [26] are followed and the survey protocol plan of this study is shown in Figure 1. The following subsections elaborate the research questions, search string and inclusion/exclusion criteria, extract data and present their analysis.

Research Questions
The first step of the systematic survey is to define the research questions. Various research questions were formulated to conduct this survey: The answers of the first, second, third and fifth question are explained in Sections 4-7. The answer to question 4 is elaborated in Section 8.

Search Strategy
An efficient search strategy is required to extract the appropriate information and filter out inappropriate studies from the research area. For this purpose, a dynamic search string was prepared, based on research questions, keywords, and alternate words for major keywords. The search string is a combination of "OR" and "AND" Boolean operator, given below. {

Research Inclusion/Exclusion Criteria
Firstly, search criteria were set to extract the maximum publications from the selected sources. The publishing years are limited between the years 2007 to 2021. In order to gather more relevant papers, the selection criteria are divided into three steps. In the first step, to remove the duplicate and irrelevant papers, the title of the paper is checked. In the second stage, we read the abstract of the papers obtained in the first stage to select relevant papers to the focused area. At the last stage, we read out the detail of each paper and finalized the papers for this study. A total of 122 papers were selected as the most relevant papers in the domain of passive video forgery. Similarly, a total of 99 research papers were selected for the primary analysis. These papers were selected as they are published in reputable journals or conferences which have more citations. The year-wise details of these papers published in conferences, journals, and others (books and thesis) are shown in Figure 2, which depicts an overall pictorial representation of published papers, books, and theses in the past 15 years on video tampering detection using blind or passive techniques. It highlights that in recent years, passive techniques for video forgery detection are drawing significant attention in the research community. There is a demand in many areas such as judicial forensics, insurance industry, information security, etc., to develop robust, standard, and economically feasible techniques for the detection of a wide variety of tampering in digital videos to overcome these challenges related to passive video forgery detection. Much progress has been achieved over the past few years, but certain important milestones still remain unmet. That is because of the wide range of possible alterations that can be applied to digital content that makes it practically indistinguishable from genuine content. The absence of a universally applicable solution to this problem has gained the attention of the scientific community and researchers. Table 2 represents the published papers on passive (blind) forgery detection techniques that are categorized according to standard journals such as IEEE, Springer, Elsevier and others, or well-known conferences. Concrete and category-wise discussions on the papers presented in Table 2 are given in Sections 6 and 7.

Types of Video Tampering (Forgery)
Videos are usually tampered in the following ways: (a) tampering in the spatial domain, (b) tampering in the temporal domain, (c) spatio-temporal tampering and (d) re-projection [2,122]. Details of spatial, temporal and spatio-temporal tampering are highlighted in Figure 3. In this figure, Fi represents the ith frame, where I = 1, 2, … n, PHW is the pixel intensity, and H and W are frame height and width, respectively. F'I is the manipulated ith frame and P'HW is the manipulated pixel intensity. A forger can tamper source videos spatially (i.e., spatial forgery) by manipulating a block of pixels within a video frame or in adjacent video frames, as shown in Figure 3b. Furthermore, as pre-

Types of Video Tampering (Forgery)
Videos are usually tampered in the following ways: (a) tampering in the spatial domain, (b) tampering in the temporal domain, (c) spatio-temporal tampering and (d) re-projection [2,122]. Details of spatial, temporal and spatio-temporal tampering are  Figure 3. In this figure, F i represents the ith frame, where I = 1, 2, . . . n, P HW is the pixel intensity, and H and W are frame height and width, respectively. F I is the manipulated ith frame and P HW is the manipulated pixel intensity. A forger can tamper source videos spatially (i.e., spatial forgery) by manipulating a block of pixels within a video frame or in adjacent video frames, as shown in Figure 3b. Furthermore, as presented in Figure 3c, source videos can be tampered with respect to time (i.e., temporal forgery) by disturbing the frame sequence through replacement, reordering, addition, and removal of video frames. Lastly, Figure 3d shows video tampering by combining both spatial and temporal domains (i.e., spatio-temporal forgery). Re-projection means recording a movie from the theatre screen by which the forger violates the copyright law. ematics 2022, 9, x FOR PEER REVIEW 7 of both spatial and temporal domains (i.e., spatio-temporal forgery). Re-projection mea recording a movie from the theatre screen by which the forger violates the copyright la

Active Approaches
The active approaches can be further divided into two categories based on a proaches to watermarks and digital signatures [123]. There are several kinds of wat marks. Fragile and semi-fragile watermarks are used to detect video forgery [124,12 Fragile watermarking works by inserting invisible information into the video. If an tempt is made to modify the contents of the video, that invisible information (watermar is also altered, and hence, forgery is detected. Semi-fragile watermarking is less sensiti to change as compared to fragile watermarking. For both the fragile and semi-frag techniques, a watermark must be inserted when the video has been recorded, whi

Video Tampering Detection
Video tampering detection approaches can be broadly classified into active and passive (blind) [4,11,[13][14][15][16], as shown in Figure 4 and described in the following subsections. passive approaches. Furthermore, whenever forensic analysis is required of any video, the source video is not available and forensic experts must make decisions based on current (under observation) video. In this case, active techniques are not workable and passive techniques are the best choice. Passive approaches are further divided into spatial and temporal tampering detection techniques, which are discussed in Sections 6 and 7, respectively.

Review of Spatial (Intra-Frame) Video Tampering Detection Techniques
Different types of information (artifacts or footprints) are available to forensic experts for the detection of spatial tampering and localization. According to this information, the methods are categorized into the following categories, shown in Figure

Active Approaches
The active approaches can be further divided into two categories based on approaches to watermarks and digital signatures [123]. There are several kinds of watermarks. Fragile and semi-fragile watermarks are used to detect video forgery [124,125]. Fragile watermarking works by inserting invisible information into the video. If an attempt is made to modify the contents of the video, that invisible information (watermark) is also altered, and hence, forgery is detected. Semi-fragile watermarking is less sensitive to change as compared to fragile watermarking. For both the fragile and semi-fragile techniques, a watermark must be inserted when the video has been recorded, which makes active techniques dependent on both algorithmic and hardware implementation [2]. All capturing devices may not have the capability to embed digital signatures or water marks. If this information is embedded intentionally in videos after the acquisition phase, this method may fail in situations where tampering is carried out before inserting the signature or watermark. Since most of the videos reported in datasets for experiments, evaluation of video forgery detection and localization have no prior information about their watermark or signature, our survey is focused on passive techniques instead of active techniques, which are highlighted in the red dotted box in Figure 4.

Passive Approaches
Passive video tampering detection techniques do not require any prior information that is embedded in videos, such as digital watermarks or signatures. These techniques work by exploiting traces left in the frames of the video due to tampering and cannot be seen with the naked eye. However, the statistical properties are changed during the tampering process. Due to the change in statistics, the inconsistencies of different features such as noise, residues, texture, abnormalities in optical flow (OF), etc., can be used in passive approaches. Furthermore, whenever forensic analysis is required of any video, the source video is not available and forensic experts must make decisions based on current (under observation) video. In this case, active techniques are not workable and passive techniques are the best choice. Passive approaches are further divided into spatial and temporal tampering detection techniques, which are discussed in Sections 6 and 7, respectively.

Review of Spatial (Intra-Frame) Video Tampering Detection Techniques
Different types of information (artifacts or footprints) are available to forensic experts for the detection of spatial tampering and localization. According to this information, the methods are categorized into the following categories, shown in Figure 5: (i) methods based on deep learning, (ii) methods based on camera source features, (iii) methods based on pixels and texture features, (iv) methods based on SVD (Singular Value Decomposition), (v) methods based on compression features and (vi) methods based on statistical features. These categories are discussed in the following subsections.

Review of Spatial (Intra-Frame) Video Tampering Detection Techniques
Different types of information (artifacts or footprints) are available to forensic experts for the detection of spatial tampering and localization. According to this information, the methods are categorized into the following categories, shown in Figure 5: (i) methods based on deep learning, (ii) methods based on camera source features, (iii) methods based on pixels and texture features, (iv) methods based on SVD (Singular Value Decomposition), (v) methods based on compression features and (vi) methods based on statistical features. These categories are discussed in the following subsections.

Methods Based on Deep Learning
Deep learning is a sub-domain of machine learning based on neural networks. Problemspecific, complex, high-dimensional features can be extracted with the help of deep learning techniques, which are helpful for classification tasks. Zampoglou et al. [106] applied Q4 and Cobalt forensic filters with pre-trained ResNet and GoogLeNet networks for the detection of spatial video forgery. Two datasets, Dev1 and Dev2, are used to evaluate the method.
The Dev1 dataset contains 30 authentic and 30 tampered videos while Dev2 contains 86 pairs of videos having 44 k and 134 k frames. The accuracy achieved on the union of Dev1 and Dev2 is 85.09%, and mean average precision is 93.69%. Yao et al. [79] used a CNN (Convolutional Neural Network) to extract complex high-dimensional features and used the absolute difference between consecutive frames to reduce the temporal redundancy, a max pooling layer is introduced to minimize the computational complexity, and a high-pass filter layer is placed to boost the residual left during the tampering process. One hundred authentic and one hundred forged videos are used to train and test the method. This method has achieved forged frame accuracy (FFACC), pristine frame accuracy (PFACC), frame accuracy, precision, recall and F1 scores of 89.90%, 98.45%, 96.79%, 97.31%, 91.05% and 94.07%, respectively. Kono et al. [94] combined a CNN and recurrent neural network to detect video forgery. The authors also developed their own dataset of 89 forged videos, named Inpainting-CDnet2014, and a dataset of 34 forged videos, named Modification Database. The method obtained an area under curve (AUC) of 0.977 and an equal error rate (EER) of 0.061 was achieved. Avino et al. [81] performed detection using auto-encoders and a recurrent neural network. The authors used only 10 videos for experiments. The receiver operating curve (ROC) was obtained to investigate the performance of the method. Kaur et al. [116] developed an inter-frame forgery detection method based on a Deep Convolutional Neural Network (DCNN). The method classifies the forged and authentic video frames on the basis of correlation. The system was evaluated on REWIND and GRIP video datasets and achieved 98% accuracy. The method has significant accuracy; however, there is a need for cross validation to ensure the generalization. Aditi et al. [114] developed a spatiotemporal video forgery detection and localization technique based on CNN. Video frames are detected as tampered or authentic using temporal CNN; latterly, the forgery in video frames is located using spatial CNN. Motion residual is used to train the model. The method was evaluated on SYSU-OBJFORG dataset and achieved comparable results. Although the method has significant performance, there is still a need for cross data validation.
The algorithms of this class give high-dimensional features and achieve suitable accuracy; however, the small size of tampering cannot be detected by employing the algorithms developed so far.

Methods Based on Camera Source
During court proceedings, when a video is presented as proof, there is a need to identify the camera that recorded the presented video. In such situations, source camera-based features are used by forensic experts. If a source camera exists, then active forgery detection techniques will be used. Otherwise, camera-based features such as fixed pattern noise (FPN), photo response non-uniformity noise (PNRU) and sensor pattern noise (SPN) are calculated from the presented video and used for forgery detection. Different authors have used camera noise characteristics to detect the spatial (object-based) forgeries [29,33,34]. Hsu et al. [29] detected forgery by calculating the noise residual from high-frequency bands, wavelet coefficients and Bayesian classifiers. The authors used three videos; each one had 200 frames with a still background and each was captured using a JVC GZ-MG50TW digital camcorder. The frame rate is 30 fps. Video resolution of each frame and bitrate is 720 × 480 pixels and 8.5 Mbps, respectively. The recall, precision, miss rate and false positive rates are 96%, 55%, 32% and 4%, respectively. This study did not localize the forged region, the dataset is limited, and videos are prepared under a controlled environment. Kobayashin et al. [33] detected forged regions by determining the inconsistencies between noise characteristics of different video frames. In this work, a Point Grey Flea digital camera was used with 128 grayscale frames. The frame rate and resolution are 30 fps and 640 × 480 pixels, respectively. During recording, the camera and object are stationary. The recall and precision are 94% and 75%, respectively. Furthermore, the authors used a limited video dataset to test the proposed technique and did not detect the temporal forgery. The proposed algorithm worked only for grayscale videos. For detection of region tampering in videos, a technique based on utilization of extrinsic camera parameters was developed by Hu, Ni et al. in [126]. At first step, each frame of the video is divided into different regions, followed by the computation of extrinsic parameters from these regions of frames. Then differences between these parameters are calculated. Lastly, a threshold is selected to identify the tampering. Fayyaz et al. [113] developed a video tampering detection method based on sensor noise patterns of video frames. The noise patterns were extracted using denoising video frames; latterly, noise patterns were averaged to detect sensor noise patterns. Locally adaptive DCT (Discrete Cosine Transform) was used to determine the sensor noise patterns. Finally, the correlation of noise residues of different video frames was computed to detect authentic or forged video. The method was evaluated using noise pattern-based dataset and achieved suitable results, but these results depend upon the physical properties of the source device.
The algorithms of this category although performed well but are dependent on the hardware.

Methods Based on Pixels and Texture Features
A basic element of the frame (image) is called a pixel. The color model of the frame (image) is defined based on the number of bits per pixel. Various color models are used in digital media, such as RGB (Red-Green-Blue), YC b C r (Y is the luminance, blue and red chroma components are C b and C r , respectively), HSI (Hue-Saturation-Intensity), CMY (Cyan-Magenta-Yellow), etc. Different types of information (such as color, gamma, intensity, hue, contrast, etc.) can be calculated mathematically from these color models. Several types of features (such as HOG (Histogram of Oriented Gradients), LBP (Local Binary Pattern), etc.) that are based on pixels can be calculated to detect the passive forgery [52]. Subramanyam et al. [41] exploited compression features and Histogram of Oriented Gradients (HOG) to detected spatial forgery. In this approach, the authors used 6000 frames from 15 different videos for spatial forgery and 150 GOPs (Groups of Pictures) of size 12 frames each for temporal forgery. The original video is compressed at 9 Mbps using MPEG-2 video codec. Spatial tampering is carried out by copying and pasting regions of size 40 × 40 pixels, 60 × 60 pixels and 80 × 80 pixels in the same and different frames. Detection accuracy (DA) is 80%, 94% and 89% for 40 × 40 pixels, 60 × 60 pixels and 80 × 80 pixels blocks, respectively. This technique detected spatial forgery with better accuracy, but training and testing are performed on a small dataset. There are certain limitations of this algorithm, i.e., it failed to detect forgery when post-processing operations such as scaling and rotation were applied to forged regions. Moreover, this technique was unable to localize the forged regions. Al-Sanjary et al. [107] exploited inconsistency in optical flow to detect and localize the copy-move forged region. This study used nine videos to test the method and achieved 96% accuracy. The performance of the method is not sufficient in high-resolution videos.
The algorithms of this class are simple, and length of feature vectors is small. However, these algorithms do not perform well when various post-processing operations are applied to hide the forgery.

Methods Based on SVD
SVD is a factorization technique that extracts geometric features. This algorithm is widely used to detect the copy-move tampering due to its invariant nature of scaling and rotation. Su et al. [64] extracted features from a difference between frames using the K-SVD (K-Singular Value Decomposition) algorithm. Features are then randomly projected to reduce their dimension. K-means clustering is applied to the reduced features to detect spatial forgery. In total, 700 videos were prepared using SONY DSC-P10 at 25 fps and acquired at 3 Mbps for experimentation. Videos were forged using the Mokey 4.1.4 tool. The accuracy, precision and recall rates for this approach are 89.6%, 89.9% and 90.6%, respectively. This approach did not localize the forged regions. The algorithms of this category, although they have simple and small feature vectors, they cannot work for all types of post-processing operations.

Methods Based on Compression
Storage space requirements can be optimized with the compression of videos. During the compression process, different types of artifacts are acquired, such as quantization, properties of a group of pictures (GOP), motion vector, etc. These artifacts can also be used for the detection of spatial forgery. Labartino et al. [46] explored video frames using Double Quantization (DQ) to detect the spatial forgery. This method worked on assumption that the video is forged (by changing the contents of a group of frames) before the second compression takes place. In [69], Tan et al. developed an approach for automatic identification of object-based forgery in videos encoded with advanced video encoding standards based on its GOP structure. Video clips of two categories are used; one category is pristine frames and the second is double compressed frames, which have undergone re-compression after manipulation. CC-PEV feature extractor extracts feature that are used by an ensemble classifier to classify the frame as pristine or forged based on double compression. The final decision was made on the basis if all I-and P/B-frames of at least one GOP are forged, in which case, that video clip is considered as forged. The evaluation was performed on the SYSU-OBJFORG dataset, but this dataset is not publicly accessible to the research community. The proposed approach achieved 80% accuracy.
Bakas et al. in [95] presented a forensic solution to detect and localize double compressionbased forgery in MPEG videos by exploiting its I-frames. They introduced CNN architecture that exploits the fact that double compression introduces specific artifacts in the DCT coefficients of the I-frames of an MPEG video. The model was tested on 20 YUV sequences in CIF of size 352 × 288 pixels taken from the video TRACE library, available online at http://trace.eas.asu.edu/yuv (accessed on 20 November 2021). They achieved detection and localization accuracy of 90% and 70%, respectively. This method has high computational complexity.
This class of algorithms depends upon the inherent attributes of cameras instead of estimating the actual inconsistencies and discontinuities in tampered videos that occurred during the forgery.

Methods Based on Statistical Features
Tone, context, and texture are the major parts of any frame (image). During the spatial video tampering process, the texture of the video is changed, which is always present in the frame (image). The statistical features can be utilized for the illustration of this texture [127,128]. Object-based video forgery [27,49,53] has been detected with the statistical features by many researchers. Richao et al. [53] employed statistical features for the detection of spatial tampering. First, four moments of the wavelet and average gradient of each color channel are calculated. These features are feed-forwarded to SVM for training of the model to classify the forged and original videos. A set of twenty videos having a resolution of 200 × 240 pixels was utilized for conducting the experiment. The accuracy and AUC are attained as 95% and 0.948, respectively. An outcome of 85.45% is represented by the ROC curve. These results are obtained on limited dataset and no experiment was performed on videos having different compression ratios. Su et al. [86] detected duplicated regions by using exponential Fourier moments (EFMs) and tampered regions were localized by utilizing the adaptive parameter-based compression tracking algorithm. This method achieved detection accuracy of 93.1%.
The algorithms of this class are based on statistical features. The feature vectors of these methods are small in length as compared to other categories of algorithms but are unable to detect forgery in presence of different types of post-processing operations. A summary of different spatial forgery techniques is shown in Table 3.

Discussion and Analysis of Spatial Video Tampering Detection Techniques
It is not easy task to work with videos as images due to their unique set of complexities. A main limitation of many state-of-the-art approaches is the lack of cross dataset validation or validation on realistically forged videos. Every technique presented in the literature is designed to deal with one type of forgery. As of now, there is no universal tool for video tampering detection. Hence, to provide a real, practically applicable solution to forgery detection and localization challenges, a comprehensive, economically feasible and versatile forensic system is needed, which is a combination of different kinds of video forgery detection techniques, where each specialized technique is responsible for detecting the types of forgery it has been developed to tackle. In comparison to the image forensic domain, the video forensic domain is seriously under-underdeveloped, and research in this field is required.  The performance of the method is not suitable in high-resolution videos

Review of Temporal (Inter-Frame) Video Tampering Detection Techniques
Forgers tamper a video temporally by inserting, duplicating, deleting or swapping frames. State-of-the-art temporal tampering (forgery) detection algorithms have been proposed [12,28,31,34,40,43,44,51,[53][54][55][56]65,66,129]. These methods are analyzed in this review. The algorithms used to detect the temporal forgery can be divided into the following categories, as shown in Figure 6: (i) methods based on statistical features, (ii) methods based on a frequency domain, (iii) methods based on residual and optical flow, (iv) methods based on pixels and texture features, (v) methods based on deep learning and (vi) others. The detail of each category is explained in the following subsections. tion to forgery detection and localization challenges, a comprehensive, economically feasible and versatile forensic system is needed, which is a combination of different kinds of video forgery detection techniques, where each specialized technique is responsible for detecting the types of forgery it has been developed to tackle. In comparison to the image forensic domain, the video forensic domain is seriously under-underdeveloped, and research in this field is required.

Review of Temporal (Inter-Frame) Video Tampering Detection Techniques
Forgers tamper a video temporally by inserting, duplicating, deleting or swapping frames. State-of-the-art temporal tampering (forgery) detection algorithms have been proposed [12,28,31,34,40,43,44,51,[53][54][55][56]65,66,129]. These methods are analyzed in this review. The algorithms used to detect the temporal forgery can be divided into the following categories, as shown in Figure 6: (i) methods based on statistical features, (ii) methods based on a frequency domain, (iii) methods based on residual and optical flow, (iv) methods based on pixels and texture features, (v) methods based on deep learning and (vi) others. The detail of each category is explained in the following subsections.

Methods Based on Statistical Features
When a forger tampers a video, its statistical properties are disturbed, and by investigating these properties, the tampered video is detected. Wang et al. [28] used a correlation between frames of a video to detect duplicated frames by using accuracy and false positive rates as evaluation measures. The algorithm was evaluated using only two videos recorded by SONY-HDR-HC3 having 10,000 frames each. One video sequence is recorded by placing the camera on a tripod and keeping it stationary throughout video recording, and a second video is recorded with a hand-held moving camera. Average detection accuracy of 85.7% and 95.2% is achieved for stationary and moving cameras, respectively, while the false positive rates were 0.06 and zero for stationary and moving cameras, respectively. The algorithm was evaluated on a very small dataset and is unable to detect forged videos when forged by means of frame insertion and deletion process. Wang et al. [54] identified forgery by calculating Consistency of Correlation Coefficients of Gray Values (CCCoGV) between frames and used SVM for classification. This technique did not localize the forged region and the video dataset is also limited. The technique did not produce results for different compression rates. The accuracy for 25 frames

Methods Based on Statistical Features
When a forger tampers a video, its statistical properties are disturbed, and by investigating these properties, the tampered video is detected. Wang et al. [28] used a correlation between frames of a video to detect duplicated frames by using accuracy and false positive rates as evaluation measures. The algorithm was evaluated using only two videos recorded by SONY-HDR-HC3 having 10,000 frames each. One video sequence is recorded by placing the camera on a tripod and keeping it stationary throughout video recording, and a second video is recorded with a hand-held moving camera. Average detection accuracy of 85.7% and 95.2% is achieved for stationary and moving cameras, respectively, while the false positive rates were 0.06 and zero for stationary and moving cameras, respectively. The algorithm was evaluated on a very small dataset and is unable to detect forged videos when forged by means of frame insertion and deletion process. Wang et al. [54] identified forgery by calculating Consistency of Correlation Coefficients of Gray Values (CCCoGV) between frames and used SVM for classification. This technique did not localize the forged region and the video dataset is also limited. The technique did not produce results for different compression rates. The accuracy for 25 frames insertion and deletion is 96.21%; for 100 frames insertion and deletion, it is 95.83%. Singh et al. [98] exploited the mean of each DCT vector of every frame and correlation coefficients to detect the duplicated frames and duplicated regions. Accuracy of 96.6% and 99.5% was achieved for detection of duplicated regions and frames, respectively. This method requires high computational time and is not able to detect a smaller number of duplicated frames and smaller duplicated regions.
Huang et al. [117] proposed the Triangular Polarity Feature Classification (TPFC) framework to detect frame insertion and deletion forgeries from videos. Input video was divided into overlapped small groups of frames. Each frame was divided into blocks, and latterly, Block-Wise Variance Descriptor (BBVD) was applied on groups of frames to compute the ratio of BBVD. Finally, to classify a video as authentic or forged, gross error detection from probability theory was employed. The framework was evaluated on 100 videos and achieved 98.26% recall and 95.76% precision. The framework also achieved 91.21% localization accuracy. Although the framework has reasonable results, cross validation is not explored, which is the ultimate way to expose the strength and weaknesses of any video forgery detection system.
The algorithms of this class are based on statistical features and have a feature vector of small length, but they are not able to detect forgery in the presence of different types of compressions.

Methods Based on Frequency Domain Features
Discrete Cosine Transformation (DCT), Discrete Wavelet Transformation (DWT) and Fast Fourier Transformation (FFT) are widely used to transform into frequency domain before extraction of features. These techniques are used to verify the small changes. Su et al. [31] utilized Motion-Compensated Edge Artifact (MCEA) and DCT on GOP for detection of video forgery by means of frame deletion. In this research work, five videos, "Bus", "Stefan", "Foreman", "Mother-daughter" and "Flower" were used. TM5 (Test Model 5) was selected as the standard MPEG-2 codec. Consecutive frames in the range of 3, 6 and 9 are deleted from the original video sequences. Videos sequences are encoded on a constant bit-rate ranging from 3 Mbits/s to 9 Mbits/s. Dong et al. [40] also used MCEA to detect the frame deletion based forgery. FFT spikes were used after double MPEG compression. In this study, four videos, "carphone", "container", "hall" and "mobile" with CIF and QCIF format were used. The third, sixth, ninth, twelfth and fifteenth frames are deleted and saved with 15 GOPs. The dataset used in this study is limited in size and the localization of the deleted frames was not exercised. Jaiswal et al. [12] extracted features through DCT, DFT and DWT from Prediction Error Sequence (PES) techniques and classification is performed through SVM and Ensemble-based classifier. This algorithm is unable to detect which frames underwent post-processing operations, such as geometrical transformations. Huang et al. [89] fused audio channels for video forgery detection, where discrete packet decomposition and analysis of singularity points of audio are used to locate forged points. Features are extracted by perceptual hash and Quaternion Discrete Cosine Transform (QDCT) to locate the forgery position in the video. The proposed technique is evaluated by creating a database of forged videos, which are taken from SULFA (Surrey University Library for Forensic Analysis), Open Video Project digital video collection (OV) and self-recorded videos. Precision and recall rates without fine detection were 0.83 and 0.80, respectively, and with fine detection, these rates were 0.9876 and 0.9867, respectively. The restriction is that an audio file is required with the video, which is not always available. Wang et al. [115] proposed a video forgery detection method based on Electronic Network Frequency (ENF). The cubic spline was used to generate the suitable datapoints of ENF signals. The forgery in a video was located using phase continuity interruption, which was observed using correlation between adjacent datapoints of ENF signals. The method has sufficient performance while detecting video forgery in terms of frame deletion, duplication, and insertion. The method is evaluated on limited dataset.
The algorithms based on frequency domain features, i.e., DCT, FFT and DWT, are simple, and the length of the feature vector is small. However, these algorithms are hardware-dependent because the noise is used as a clue for forgery.

Methods Based on Residual and Optical Flow
Optical flow is a technique that can be calculated by estimating the apparent velocities of movement of brightness patterns from a frame of videos. Similarly, motion residual can also be calculated to estimate the motion in a video [130]. These characteristics can also be useful to detect modifications in a video. Shanableh et al. [44] extracted features based on prediction residuals, a percentage of intra-coded macro-blocks, quantization scales and reconstruction quality of a video. Feature dimension is reduced using Spectral Regression Discriminant Analysis (SRDA). K-Nearest Neighbor (KNN), Support Vector Machines (SVM) and Logistic Regression are used to detect the accuracy of the algorithm. The author used 36 video sequences for testing the proposed work with deletion of 1 to 10 frames. The true positive rates of 94% and 95.4% were claimed using SVM classifier with CBR and VBR, respectively, and false positive rates of 5.5% and 8.2% were achieved by using SVM classifier with CBR and VBR, respectively. The algorithm was tested on limited compression rates. Chao et al. [43] detected frame insertion and deletion by using the fluctuation characteristics of optical flow. In this study, test videos are taken from KTH database and TRECVID Content-Based Copy Detection (CBCD) scripts are used for insertion of frames. Similarly, the CBCD script is used for the deletion of frames. This research detected both types of forgery but has not been tested on different compression ratios. The recall and precision are 95.43% and 95.34%, respectively. Feng et al. [55] proposed an algorithm based on the total motion residual of video frames to detect the frame deletion point. The algorithm is tested on 130 raw YUV tampered videos and made with 5, 10, 15, 20, 25 and 30 deleted frames. True positive and true negative rates were 90% and 0.8%, respectively. The algorithm localized the deletion point but did not consider different compression ratios. Fluctuation features were developed by Feng et al. [70] based on frame motion residual to identify frame deletion points (FDP). Post-processing is used to eliminate minor interferences (sudden lighting change, focus vibration, frame jitter). The proposed technique is evaluated on quick and slow-motion videos to detect frame deletion. The TPR (true positive rate) is 90% if 30 or more than 30 frames are deleted. Performance decreases if the number of frames deleted is lower. This approach is not effective for videos with slow-motion content. Kingra et al. [76] proposed a hybrid technique capable of detecting frame insertion, deletion and duplication exclusively. Multiple features generated by optical flow (OF) and prediction residual (PR) are combined to identify frame base tampering under some threshold. The proposed algorithm was tested on surveillance videos having static background and self-recorded mobile videos. The detection and localization accuracy were 83% and 80%, respectively. This technique can deal individually with frame insertion, deletion, duplication and localization, but did not give satisfactory performance for video sequences that have high illumination. Thorough analysis revealed certain drawbacks. First, this technique was developed for videos having fixed GOP structure and it fails when a whole GOP or its multiples undergo some tampering attack. Second, it is dependent on the number of thresholds that were selected empirically, so there is a lack of flexibility. Third, the model was tested on self-created video sequences that were not sufficient to provide a precise estimation of the applicability of this technique in real scenarios. Jia et al. [85] also used optical flow sum consistency for the detection of duplicated frames in the video. This study used 115 videos to test the proposed algorithm, which are tampered with 10, 20 and 40 duplicated frames. Poor performance is achieved on videos made by a static camera. Joshi et al. [99] exploited frame prediction error and optical flow to classify the authentic and forged videos. Although this method achieved accuracy of 87.5%, it cannot work well for videos shorter than 7 s.
The algorithms of this class are also simple, and the length of feature vector is small; however, they are not able to work on different types of compression rates.

Methods Based on Pixel and Texture
Texture is an important property of the images that can be used for different types of classification and identification problems. For texture analysis, the pixels are the basic unit. Various texture descriptors are available in the literature that can be used for various tasks. During the tampering process, the texture of the frames of a video is also disturbed and several authors used texture features to detect the tampering in a video. Zhang et al. [66] used quotients of correlation coefficients among sequential Local Binary Pattern(LBP)coded frames as features and correlation to detect the insertion and deletion of frames. This approach can detect if forgeries exist or not, but it cannot differentiate between frame deletion and insertion forgery. Performance reduces if small numbers of frames are inserted or deleted. Additionally, the forged region is not localized. The precision and recall rates are 88.16% and 85.80%, respectively. The proposed work was not tested for videos compressed at different compression rates. Liao and Huang [48] extracted Tamura texture features, which are based on contrast, orientation and roughness of a video frame and combined into a 3D feature vector. Euclidean distance is calculated to find the duplicate frame of all feature vectors of all the frames of a video. The method was tested on 10 videos captured using stationary and moving hand-held cameras having a resolution of 640 × 480 pixels and a frame rate of 25-30 fps. The method obtained precision of 99.6%. This method is weak to detect highly similar and duplicated frames having slow sharpness changes. Zhao et al. [88] proposed an algorithm that is divided into two stages. In the first stage, HSV (Hue-Saturation-Value) color histograms are calculated for each frame in a video shot, and similarities between histograms are compared for the detection and localization of tampered frames. Once the forged position is obtained, in the second stage, the candidate frames are double checked by extracting features through SURF (Speeded Up Robust Features) and FLANN (Fast Library for Approximate Nearest Neighbors) matching as a similarity analysis. This method used 10 video shots of different lengths. The precision, recall and accuracy are used as evaluation measures. The method gives suitable results, but only on a small dataset of 10 shots and does not work on grayscale videos. Bakes et al. [100] used Harlalick features of a gray-level co-occurrence matrix (GLCM) for detection of insertion, duplication and deletion of frames. This study used 30 videos tampered with the insertion, deletion and duplication of 10, 20, 30, 40 and 50 frames. Precision, recall and F1 score are used to evaluate the method. The main benefit of the proposed approach is that it does not depend on the size/structure of GOP and the number of deleted frames. However, this method requires a high execution time and cannot detect frame shuffling forgery. Furthermore, it does not work well in the presence of different compression ratios.
Kharat et al. [112] proposed a video forgery detection and localization method based on motion vector, Scale Invariant Feature Transform (SIFT). The forged video frames were identified using motion vector. SIFT features were computed to compare forged frames. Lastly, RANSAC was utilized to localize the forged region. This method was evaluated both on compressed and uncompressed videos. The method achieved overall 99.8% detection accuracy (DA), which is better as compared to other methods. The method was evaluated on 20 videos downloaded from YouTube. It has reasonable performance on duplicate frame detection and localization; however, the method was evaluated on limited authentic and forged videos. Fadl et al. [111] proposed a framework to detect duplicated and shuffled frames based on temporal average and gray-level co-occurrence matrix. The framework achieved 99% precision even in the presence of post-processing operations with high false positives due to weak boundaries of duplicated frames. The method was evaluated on SULFA and LASIESTA datasets. Shelke and Kasana [120] proposed a passive algorithm that utilizes entropy-based texture features, correlation consistency between entropy coded frames and abnormal point detection to detect as well as localize multiple inter-frame forgeries. A dataset of 30 original and 30 forged videos was prepared by using original videos from SULFA, REWIND and VTL. This dataset is not publicly available. Although detection and localization accuracies are 97% and 96.6% in the case of multiple forgeries, this accuracy is attained on a small dataset of 60 videos.
The techniques in the category produced suitable results; however, these methods have long features length and complexity is high.

Methods Based on Deep Learning
The use of deep learning in the domain of computer vision encourages researchers and scientists to employ deep learning and machine learning models in the domain of video forensics.
In the past few years, deep learning-based methods such as CNN have attained great success in the domain of image processing and computer vision. The reason is that deep neural networks are capable of extracting problem-specific and complex high-dimensional features to efficiently represent the information needed. Deep learning-based approaches have been used recently in many fields, such as camera model identification [131], steganalysis [132], image manipulation detection [133], image copy-move forgery detection [134] and so on. I3D and Siamese(Resnet152) are used for feature extraction, frame duplication detection and localization in videos by Long et al. [109]. Duplicated frames are distinguished from original frames by an inconsistency detector using I3D. Evaluation was performed on self-recorded iPhone videos, VIRAT [135], and Media Forensics Challenge dataset (MFC18), which is not publicly available. Accuracy of 81% and 84% is obtained in case of iPhone and VIRAT videos while the MCC (Matthews Correlation Coefficient) scores for MFC-dev and MFC-eval set were 0.66 and 0.36, respectively. This technique is capable of detecting just one type of temporal tampering; other manipulation tasks are not carried out, such as frame dropping, frame shuffling, frame rate variations, and effect of various video codecs on algorithm accuracy. Zampoglou et al. [106] explored the potential of two novel filters based on DCT and video requantization error. The output of these filters is used to train deep learning model CNN to discriminate authentic videos from tampered. The model is evaluated on two datasets, one is provided by the NIST 2018 Media Forensics Challenge, and the second is InVID Fake Video Corpus. The accuracy is 85% when training and testing are performed on the same MFC dataset and 60% when testing is performed on the videos of the FVC dataset. Availability of annotated data is one major requirement in this approach, and localization is not addressed. Johnston et al. [136] developed a framework using a CNN for tampering detection which extracted features from authentic content and utilized them to localize the tampered frames and regions. The CNN was trained to estimate quantization parameters, deblock setting and intra/inter mode of pixel patches from an H.264/AVC sequence with suitable accuracy. These features are used for localization of tampered regions in singly and doubly compressed videos having different bitrates. Fadl et al. [118] proposed a system for inter-frame forgery detection where a video is divided into video shots then spatial and temporal information is fused to create a single image of each shot. A pre-trained 2D-CNN model is used for efficient spatiotemporal feature extraction. Then, the structural similarity index (SSIM) is applied to produce deep learning features of a whole video. Finally, they used 2D-CNN and RBF Multiclass Support Vector Machine (RBF-MSVM) to detect temporal tampering in the video. To evaluate the performance of the proposed model, they created their own dataset containing 13135 videos containing three types of forged videos under different conditions by using original videos from VRAT, SULFA, LASIESTA and IVY datasets and achieved TPRs of 0.987, 0.999 and 0.985 for the detection of inter-frame forgery, namely, frame deletion, insertion, and duplication, respectively. Techniques based on deep learning are data-driven (i.e., requiring a large volume of data), and they have the capability to automatically learn high-dimensional features required to detect tampering in the video.

Others
Some other techniques are also proposed that cannot be categorized. Patel et al. [65] detected temporal forgery based on the EXIF (Extended Image Format) image tag. By analyzing the difference between consecutive frames of the video, the authors successfully identified the tampered region by using the EXIF tag. Although this method localized the forged region, a large database of EXIF tags is required. Gironi et al. [56] used the Variation of Prediction Footprint (VPF) tool with some changes for detecting the frame insertion and deletion. VPF tools are also used for detecting whether the video is encoded or not [42]. This method works for different compression ratios, but it cannot detect frame manipulations when the attacker deletes/inserts a whole group of pictures (GOP). Moreover, the accuracy is 91% but the dataset for training and testing is limited. To overcome the false detections caused by optical flow features and video jitter noise in inter-frame forgery, Pu et al. [119] proposed a novel framework for the detection of inter-frame forgery from the videos with severe brightness changes and jitter noises. A new OF algorithm was introduced to extract stable features of texture changes. It was based on intensity normalization to reduce the impact of illumination noises, and motion entropy to detect jitter noises. Different thresholds are defined for motion entropy to determine whether a video is jittery or not. Experiments were performed on 200 videos taken from three publicly available datasets: SULFA, the CDNET video library and VFDD video lab. Accuracy of 89% was obtained.
Huang et al. [121] proposed a novel cross-modal system that can detect and localize forgery attacks in each frame of live surveillance videos. They prepared their own dataset by collecting multimodal data of half an hour in total. For intra-frame attack, Faster-RCNN is used to detect and crop a human object out and then replace it with the corresponding blank background segment. Forgery detection accuracy of 95% was found on their test data. No cross-dataset validation has been carried out. The algorithms discussed in this section used different methods for feature extraction and classification. Significant temporal forgery techniques in the literature are summarized in Table 4.

Discussion and Analysis of Temporal Video Tampering Detection Techniques
There exist many models that exploit unique features in videos, such as motion features, noise features, video compression and coding features, color models and GLCMbased features. There are a few limitations of the current strategies, which opens doors for future researchers to conquer these constraints. The existing models are exclusively designed to identify specific types of temporal tampering and operate with some assumptions on selected data. Therefore, the methods developed for a specific type of tampering are incapable of addressing real practical applications due to the diversity in traces left by each type of tampering. There is a serious lack of an efficient approach for the detection of all kinds of video tampering in this domain. Moreover, existing methods are unable to detect tampering if a video has undergone multiple types of tampering attacks.
Many investigators have performed experiments on synthetically doctored videos. While many temporal tampering detection techniques work well on a selected set of videos, they fail to achieve such performances on other unknown video datasets. Moreover, we could not compare the accuracy of these methods because they are evaluated on their own custom-built datasets that satisfy their research assumptions and constraints. In most studies, the efficiency is not reported. Therefore, developing a robust technique for video temporal tampering detection which is capable of detecting all types of temporal tampering and localizing the tampered region is still a cutting-edge research area of video forensics.     Has not worked to localize the forged object Dataset is small

Research Challenges
Given our analyses of the existing literature on passive video tampering techniques, this field of research faces the following challenge.

Benchmark Dataset
Performance of every recognition system depends on its training, testing and evaluation. The dataset is the key for proper training, testing and evaluation for any proposed algorithm. To the best of our knowledge, existing video forgery datasets are not appropriate due to being small in size and lacking post-processing operations such as rotation, scaling, blurring, compression, etc. [137]. The details of existing datasets for video forensic analysis are presented in Table 5. Many researchers have developed their own datasets [70,73,85] to conduct experiments for inter-frame forgery detection, but these datasets are not available for other communities/researchers to evaluate the performance of the proposed algorithms. This portrays video tampering detection as a solved problem on specific, self-created, small datasets with high accuracy, which may discourage other researchers from publishing their work with less accuracy. In this regard, a great effort has been made for image forensics and source device identification [138]. On the contrary, no benchmark dataset is available for video forensics. To prepare tampered videos manually is a highly time-consuming process, so many authors used synthetically doctored videos for their experiments, such as Panchal et al. in [139].
Therefore, a benchmark dataset for proper training and testing needs to be developed that could give an unbiased and neutral platform for comparison of various techniques with existing state-of-the-art video tampering (forgery) detection techniques.

Performance and Evaluation
Most video forgery algorithms are based on camera source identification; therefore, the results can be negatively affected by increasing the number of cameras. Moreover, the camera source identification methods are noted to be dependent on intrinsic camera hardware features such as lens and charge-couple device (CCD) sensor characteristics that can degrade performance of the algorithm. Video double compression artifacts add difficulty to the localization of the video forgery, especially when the video being analyzed is compressed by a low-quality factor, which is seen in most of the recent methods. Similarly, video forgery detection depends on post-processing operations such as edge blurring, compression, noise, scaling, rotation, etc., and can cause high false positives. Most of the existing methods on video forgery detection have no resistance to such post-processing operations. All these aspects degrade the performance of the techniques.
The existing methods are evaluated with different metrics; that's why they can't be compared with each other. Thus, there is a need for standard evaluation measures based on inconsistent lighting and correlation between pixels, so that comparisons can be easily carried out between different algorithms.

Automation
Existing methods of video forgery detection and localization are not fully automated and require human interpretation, which results in poor accuracy.

Localization
Video forgery detection makes a user aware of if the video is authentic or not, but when a user knows which part of the video is forged, the trustworthiness of forgery detection systems will increase. To determine the accurate location of video tampering is another big challenge. Some of the developed approaches are capable of localizing the tampered region in a video, but accuracy rates were inadequate; furthermore, in many studies, little attention has been paid to localizing the tampered region. Moreover, no remarkable results have been observed in existing methods to localize the traces of forged regions in tampered videos. As existing methods have not modeled the structural changes properly, this occurred in videos after spatial forgery. Due to these reasons, the accuracy of localizing the forged region is still a challenge.

Robustness
An algorithm is known to be robust if it detects and localizes every type of forgery in general and not specifically on a certain dataset. Most of the reported algorithms have high accuracy on certain datasets on which they are evaluated but not in general, which makes it difficult to perform comparative analyses among existing techniques. An important limitation of existing methods is the lack of sufficient validation of standardized datasets. Thus, there is a need to establish benchmarks for the detection and localization of all types of forgery in videos by ensuring high accuracy so that it would be appropriate to deploy in real practical applications.

Future Directions
A standard dataset may be developed to benefit the research community to train, test and evaluate their algorithms. Video forgery may be detected and localized in the following ways. The whole process of video tampering detection and localization is elaborated in Figure 7. Initially, features can be extracted through different multi-resolution techniques, namely, local binary pattern (LBP) [143], Weber's law descriptor (WLD) [144] and discriminative robust local binary pattern (DRLBP) [145]. Complementary features can then be integrated from these techniques to gather more discriminative features. Principle component analysis (PCA) is likely to be used for selecting the most suitable or unique features out of the extracted features [146]. These selective features can then be passed to an SVM to classify the video as forged or authentic [147].
Edges are tampering artifacts and give better representation of the objects. The edge irregularity caused by tampering can be noticed in chrominance channels. The YCbCr color model was used by Muhammad et al. in [148] as a pre-processing step to extract features from Cb and Cr channels to represent the structural changes. The reason to extract features using Cb and Cr components is to gather discriminative features which represent the information of edges caused by tampering, because edges appeared sharply in the Cb or Cr channel. Although LBP gives texture information, it failed to retrieve edge information. Since DRLBP and WLD contain both edge and texture information and produce discriminative features to represent the clues of forgery, more accurate results are expected than LBP in detecting video tampering in the spatial domain. Similarly, the spatial/temporal forged region can be localized by using either block-based or clusteredbased techniques.
information. Since DRLBP and WLD contain both edge and texture information and produce discriminative features to represent the clues of forgery, more accurate results are expected than LBP in detecting video tampering in the spatial domain. Similarly, the spatial/temporal forged region can be localized by using either block-based or clustered-based techniques. Efficiency is another major concern due to the high volume of video frames under observation. For better accuracy and efficiency, Convolutional Neural Network (CNN)-based algorithms such as deep learning (DL), auto encoder or deep belief networks (DBN) can also be evaluated [149] due to their success in artificial intelligence (AI) tasks such as image recognition [150], speech recognition [151] and natural language processing (NLP) [152].
Deep learning [153] has inspired other machine learning techniques to foresee the activity of potential drug molecules [154], reconstruct brain circuits [155], online particle detection [156], predict the effects of mutations in non-coding DNA on gene expression and disease [157], and many other applications. CNN [158] is specialized as fully con- Efficiency is another major concern due to the high volume of video frames under observation. For better accuracy and efficiency, Convolutional Neural Network (CNN)based algorithms such as deep learning (DL), auto encoder or deep belief networks (DBN) can also be evaluated [149] due to their success in artificial intelligence (AI) tasks such as image recognition [150], speech recognition [151] and natural language processing (NLP) [152].
Deep learning [153] has inspired other machine learning techniques to foresee the activity of potential drug molecules [154], reconstruct brain circuits [155], online particle detection [156], predict the effects of mutations in non-coding DNA on gene expression and disease [157], and many other applications. CNN [158] is specialized as fully connected layers and is also easy to train. Major technology companies including Google, Facebook, Yahoo!, Twitter, Microsoft, and IBM have used CNN-based algorithms.
CNN on the large scale is not extremely fast; therefore, CNN-based hardware chips are developed by NVIDIA, Mobil eye, Intel, Qualcomm, and Samsung to reduce the training time. For better efficiency, we also need to think about the extreme learning machine (ELM). ELM not only achieves state-of-the-art results but also shortens the training time from days (spent by deep learning) to several minutes without scarifying the accuracy. Extreme learning is successfully performed in applications such as soft-sensing in the complex chemical process [159], face recognition [160] and many more.
Transfer learning [161,162] is another topic of ongoing interest in the machine learning community. It is the process of the improvement of learning in a new task where training data are limited through the transfer of knowledge from a related task that has already been learned. This shortage of training data can be due to several reasons, such as data being fitful, costly to collect and label or being unavailable. Many applications of machine learning are successfully applied transferring learning for image classification [163], human activity classification [164], event classification from a video [165], software prediction [166], multi-language text classification [167] and many others. Since the benchmarked forged video datasets are not available, a learning system for video tampering analysis can be developed through transfer learning techniques by using existing partially or closely related learning models.

Conclusions
Digital video forensics is still in its infancy and the reliability of digital video as a reference in court is questionable due to tampering (forgery). Numerous video editing tools such as Adobe's (Premier and After Effect), GNU Gimp, Premier and Vegas are readily available to tamper videos. Several techniques have been proposed in the literature to detect tampering, and they all suffer from their share of limitations. In this study, we carried out a systematic review of digital video forgery detection techniques and provided answers to the research questions guiding this work. The existing passive video forgery detection and localization techniques are categorized into spatial and temporal techniques. These spatial and temporal techniques are further categorized based on their features. We performed in-depth investigations of methods, their comparative analysis and the merits and demerits of each category, and we debated challenges extracted from video forensics literature. The review of related work illustrates that various features can be exploited to detect and localize forgery. LBP, frame motion residual, noise features, SURF and optical flow give suitable detection accuracy, but their performance is reduced due to presence of illumination, static scenes, tampering of small number of frames, video quality and variable GOP sizes. Even though techniques based on deep learning are convincing, few researchers have adopted it due to the unavailability of large video forgery datasets. Secondly, the detection of inter-frame forgeries has been addressed exclusively, highlighting the need to establish benchmarks for detection and localization of all kinds of temporal tampering in videos by ensuring high accuracy. Thirdly, to the best of our knowledge, no work is available in the public domain that can detect tampering if a video has undergone multiple types of tampering attacks. The detection of multiple types of tampering in a video is an area of research that needs to be explored. Fourthly, manually producing tampered videos is very time-consuming task, which is why most researchers performed their experiments on synthetically doctored video sequences. Finally, an important limitation of existing methods is the lack of sufficient validation on standardized datasets.

Conflicts of Interest:
There is no conflict of interest with respect to the research.