An Automated Approach for Electric Network Frequency Estimation in Static and Non-Static Digital Video Recordings

Electric Network Frequency (ENF) is embedded in multimedia recordings if the recordings are captured with a device connected to power mains or placed near the power mains. It is exploited as a tool for multimedia authentication. ENF fluctuates stochastically around its nominal frequency at 50/60 Hz. In indoor environments, luminance variations captured by video recordings can also be exploited for ENF estimation. However, the various textures and different levels of shadow and luminance hinder ENF estimation in static and non-static video, making it a non-trivial problem. To address this problem, a novel automated approach is proposed for ENF estimation in static and non-static digital video recordings. The proposed approach is based on the exploitation of areas with similar characteristics in each video frame. These areas, called superpixels, have a mean intensity that exceeds a specific threshold. The performance of the proposed approach is tested on various videos of real-life scenarios that resemble surveillance from security cameras. These videos are of escalating difficulty and span recordings from static ones to recordings, which exhibit continuous motion. The maximum correlation coefficient is employed to measure the accuracy of ENF estimation against the ground truth signal. Experimental results show that the proposed approach improves ENF estimation against the state-of-the-art, yielding statistically significant accuracy improvements.


Introduction
The vast amount of information contained in multimedia content, i.e., audio, image, and video recordings, has prompted perpetrators to commit forgery attacks distorting the digital content. Digital forensics advancements have experienced an exponential growth in the last decades, as digital manipulation methods are constantly evolving and affecting various aspects of social and economic life. To this end, emphasis has been put on advancing emerging technologies in the field of digital forensics, which can efficiently verify the authenticity of multimedia content and cope with multimedia forgeries. A comprehensive survey of image forensics techniques can be found in [1].
In recent years, the Electric Network Frequency (ENF) has been employed as a tool in forensic applications. The ENF is a time-varying signal, which fluctuates around its nominal frequency, i.e., 50 Hz in Europe and 60 Hz in the United States. These fluctuations are due to the instantaneous load differences of the power network (i.e., the power grid). They exhibit an identical trend within the same interconnected network. The ENF is a non-periodic signal, which can act as a fingerprint for digital forensics applications [2]. It can be embedded in digital audio recorded by devices plugged into the power mains or by devices placed near the electric outlets and power cables. The ENF can be captured in video recorded in indoor environments due to fluorescent light. Illumination intensity variations resemble ENF variations in the power grid [3]. Thus, ENF estimation can be exploited for multimedia authentication, timestamp verification, and forgery detection in audio and video recordings. Until recently, the research has mainly been focused on audio recordings, where many advances have been achieved.
To begin with, let us briefly survey ENF estimation in audio recordings, because the same ENF estimation methods are also applied to a one-dimensional (1D) time-series extracted from video recordings. A comprehensive study addressing the ENF detection problem was presented in [4], where many practical detectors were introduced. The detectors were shown to have a reliable performance in relatively short recordings, enabling accurate ENF detection in real-world forensic applications. An alternative to the conventional Short-Time Fourier Transform (STFT) is advanced spectral estimation [5], offering high-resolution at the expense of increased computational complexity. For example, an iterative adaptive approach accompanied by a dynamic programming was applied to frequency tracking. An optimized maximum-likelihood estimator for ENF estimation was proposed by employing a multi-tone harmonic model [6]. Multiple harmonics were combined to provide a more accurate estimation of the ENF signal and the Cramer-Rao bound was used to bound the variance of the proposed estimator. Following the same reasoning, a spectral estimation approach was presented in [7], combining the ENF at multiple harmonics. Each harmonic was weighted depending on its signal-to-noise (SNR) ratio. A pre-processing approach was proposed in [8] that was based on robust principal component analysis to reduce noise interference and to enable accurate ENF estimation. There, a weighted linear prediction approach was also employed for ENF estimation. In [9], a lag window was designed to offer an optimal trade-off between smearing and leakage by maximizing the relative energy in the main lobe of the window. It was incorporated in the Blackman-Tukey method, offering accurate ENF estimation with low computational requirements. A Fourier-based algorithm for high-resolution frequency estimation was introduced in [10]. Specific spectral lines were taken into consideration instead of the entire frequency band. In [11], a comprehensive study of the parameters that affect ENF estimation accuracy was undertaken. In the pre-processing stage, signal filtering and temporal window choice were found to be critical in delivering accurate estimation results. A fast version of Capon spectral estimator based on Gohberg-Semencul factorization was presented in [12]. That method along with the use of a Parzen temporal window led to accurate ENF estimation. To address the problem of noise and interference, frequency demodulation was employed for ENF estimation [13]. Several high-resolution frequency estimation methods were discussed in [14]. That work aimed to achieve high performance and to maintain low computational complexity by using as few samples per frame as possible. An integrated and automated scheme for ENF estimation was developed in [15]. A framework for ENF estimation from real-world audio recordings was presented in [16]. First, signal enhancement was proposed, which was based on harmonic filtering. Second, graph-based harmonic selection was elaborated. In [17], a unified approach was proposed to detect multiple weak frequency components under low SNR conditions. Iterative dynamic programming and adaptive trace compensation were employed to identify the frequency components. A multi-tone model for ENF detection applied prior to ENF estimation was presented in [18].
The ENF can also be exploited to detect tampering in multimedia recordings. An edit detection approach taking advantage of the time-varying nature of ENF was proposed in [19]. Multimedia authentication was formulated as a problem of phase change analysis employing the Fourier Transform in [20]. An audio verification system for tamper detection and timestamp verification was proposed in [21]. The system employed absolute-errormaps. A tamper detection framework based on support vector machines was introduced in [22]. That framework exploited abnormal ENF variations caused by tampered regions.
In [23], it was demonstrated that the ENF can be exploited to determine the location of recordings even if they are captured within the same interconnected grid. A multi-class machine learning system was proposed to identify region-of-recordings in [24]. It took advantage of features related to ENF differences among power grids without the need for a reference ENF signal. A convolution neural network system was tested for identifying audio recordings that have been recaptured in [25]. The system worked properly for very short audio clips and was able to combine both the fundamental ENF and its harmonics. To cope with noise interference, a filtering algorithm was introduced in [26]. It employed a kernel function to create a time-frequency representation facilitating ENF estimation. The existence of reliable ENF reference databases is critical for multimedia authentication applications. A method to create ENF reference databases based on geographical information systems (GIS) was presented in [27]. Recently, ENF was explored as a tool for device identification [28]. The proposed method was based on the analysis of harmonic amplitude coefficients, which were employed to deliver an accurate identification of acquisition devices. The ENF is a stochastic signal and its values depend on various exogenous and endogenous factors. In [29], a study was carried out on the factors affecting the capture of ENF in audio recordings as well as on the impact of the audio characteristics.
Although significant attention has been paid to ENF estimation in audio recordings, it was found that the ENF can also be traced in video recordings. The ENF can be estimated in videos captured under the illumination of fluorescent bulbs in indoor environments [3]. ENF variations caused by power grid networks affect the illumination intensity, and each frame captures a time-snapshot of ENF. ENF video estimation approaches can be divided into two categories based on the recording sensor type. The first category consists of videos captured by charge-coupled device (CCD) sensors, which employ a global shutter mechanism. This type of sensor instantly captures all pixels of a frame. Thus, each frame depicts a specific time snapshot. When CCD sensors are used, the state-of-the-art approach for ENF estimation is based on averaging all pixels in each frame of static videos [3]. For non-static videos, state-of-the-art ENF estimation suggests averaging all steady pixels in each video frame. The second category consists of videos captured by complementary metal oxide semiconductor (CMOS) sensors. Such sensors employ a rolling shutter mechanism, which acquires a row at a time in each video frame [3,30]. A comprehensive analysis of the rolling shutter effect was conducted in [31]. An analytical model for videos captured using a rolling shutter mechanism was developed, demonstrating the relation between ENF variations and the idle period length. ENF-based video forensics are not trivial, especially for non-static video recordings. ENF presence detection based on superpixels (i.e., multiple pixels) was proposed in [32]. The proposed approach could be applied to static and nonstatic videos captured by both CCD and CMOS camera sensors. Recently, a method for ENF estimation in non-static videos was presented in [33]. This method could be accurately utilized in video recordings whose frame rate is unknown. The ENF was applied to video recordings for camera identification in [34]. Video synchronization can be efficiently achieved by employing the ENF. Video synchronization methods were developed in [35,36] that were based on ENF signal alignment. A forgery detection algorithm based on ENF signal was proposed in [37] without needing any ground truth signal. A technique to detect false frame injection attacks in video recordings using the ENF was discussed in [38]. ENF was employed to authenticate video feeds from surveillance cameras. ENF estimation and detection in single images captured by CMOS camera sensors constitutes a challenging task. Novel investigations taking into consideration the ENF strength were described in [39]. ENF estimation in videos with a rolling shutter mechanism was presented in [40]. Both parametric and non-parametric spectral estimation methods were combined for accurate ENF estimation.
In this paper, inspired by [32], an automated approach is proposed for ENF estimation from CCD video recordings based on Simple Linear Iterative Clustering (SLIC) [41]. Areas of common characteristics that include superpixels are generated using the SLIC algorithm. The proposed approach takes into consideration only the superpixels whose average intensity exceeds a predefined threshold. It is shown that within these areas, the embedded ENF is not hindered by any interference, resulting in more accurate estimation regardless of whether the video recording is static or not. The novelty of the proposed approach lies in (1) the creation of areas with similar characteristics and (2) the estimation of ENF exploiting only these areas in contrast to what has been achieved for ENF estimation in videos so far. The motivation for the development of the proposed approach is to mitigate the interference and noise caused by textures, shadows, and brightness that are present in reallife applications, such as surveillance videos. By doing so, we advance the related literature, where static videos are mostly used, such as the "white wall" recordings. From a practical point of view, the proposed approach enables automated ENF estimation regardless of whether the video recording is static or non-static. Thus, it can be applied to practical forensics applications, such as multimedia content authentication, indicating the place where a recording was captured, and revealing the time the recording was made. It is worth noting that the proposed approach is tested on real-world static and non-static videos of escalating difficulty in order to simulate real conditions. The maximum correlation coefficient (MCC) between the estimated ENF and the reference signal is employed to measure ENF estimation accuracy. Moreover, hypothesis testing is performed to assess the statistical significance of the improvements delivered by the proposed approach.
The remainder of the paper is organized as follows. Section 2 details ENF fundamentals, and Section 3 presents the proposed approach; Section 4 describes the dataset and discusses the derived results; conclusions, limitations, and future research are drawn in Section 5.

ENF Fundamentals
The ENF was initially introduced by C. Grigoras [2,42] to attest to the authenticity of digital recordings, to determine the time they were recorded, and to indicate the area they were captured. In particular, when it comes to video recordings, ENF estimation can determine whether the multimedia content has undergone major alterations. Moreover, ENF can reveal the area where the indoor video was recorded. When the estimated ENF is compared against a reference ground truth, the time the video was recorded is revealed. The proposed approach aims at improving ENF estimation, whose practical applications fall into forensic science. The importance of ENF is due to its unique properties, which makes it a powerful tool in forensic applications. Once the ENF signal has been estimated, a comparison against a reference ENF database should be made in order to assess estimation accuracy.
The most remarkable properties of the ENF signal are summarized as follows: • The ENF is a non-periodic signal randomly fluctuating around the fundamental frequency. • ENF fluctuations are identical within the same interconnected network. • The ENF signal can also be found in higher harmonics [43].
Many approaches have been proposed to efficiently estimate ENF depending on the particularities of each recording.

ENF Estimation
The ENF is embedded in the electric light signal. Assuming stationarity within shorttime segments of the signal, the ENF is modeled as where f is the fluctuating frequency representing the ENF component, A is the signal magnitude, and φ corresponds to signal phase. There are more complex ENF models, such as that proposed in [44]. It has been shown recently that ENF traces can be embedded in video recordings due to light intensity variations. Such recordings are captured in the presence of fluorescent light or the light emitted by incandescent bulbs [35]. The light intensity is directly connected to electric current and its nominal frequency is influenced by the ENF signal, fluctuating at twice the nominal frequency of ENF, i.e., 100 Hz in Europe, and 120 Hz in the United States. The lower temporal sampling rate of cameras capturing video recordings compared to frequency components in light flickering results in a significant aliasing of ENF signals. Thus, ENF is present at different frequencies than those appearing in audio recordings.
These frequencies can be derived by applying the sampling theorem [45]. In addition to the fundamental frequency of power mains, it is the frame rate of video camera that influences the aliased base frequency of ENF in video recordings [3]. The aliased frequency f E emanated from fluorescent illumination is given as follows [46]: where f s denotes the sampling frequency of camera, f l denotes the frequency of light source illumination, and γ denotes an integer. Aliased frequencies of ENF based on different camera frame rates and power main frequencies are listed in Table 1. The ENF estimation procedure in video recordings differs slightly from that employed in audio ones. The difference is in the pre-processing stage. Two cases are examined depending on whether the video recordings are static or non-static. Regarding static videos, the state-of-the-art [3] suggests computing the mean intensity of each frame, transforming the two-dimensional (2D) images into a 1D time-series. It is worth noting that the majority of experiments conducted so far employ static recordings of white wall videos. Here, we employ a variety of static recordings different than white wall videos, as detailed in Section 4.1. Regarding non-static videos, the current practice is to compute the mean intensity of relatively stationary areas of each frame. In both categories, a 1D time-series is formed and the estimation procedure follows that employed for audio recordings. This time-series is treated as a raw signal that is passed through a zero-phase bandpass filter around the frequencies where ENF appears. Specifically, the bandpass edges of the filter are set at 9.9 and 10.1 Hz when the nominal frame rate is 30 Hz despite the fact that the nominal frame rate was claimed to be 29.97 Hz in [33]. The bandpass edges employed herein accommodate also the aliased base frequency, which corresponds to a nominal frame rate of 29.97 Hz. The filtering procedure is of crucial importance in ENF estimation [11]. Subsequently, the signal is split into V overlapping segments of L samples size. Each segment is shifted by S s from its immediate predecessor and is multiplied by an L-size rectangular window. Any temporal window can be employed in the pre-processing procedure. Afterward, the prevalent frequency of each segment is estimated by spectral estimation. Frequently, a quadratic interpolation is used to overcome the interference that hinders the entire procedure and results in more precise ENF estimation [5,9]. Here, the estimated ENF signal f is calculated by employing shifts of 1 s (i.e., S = 1).

Proposed Method
Here, a video ENF estimation approach for static and non-static video recordings is proposed. It is based on the SLIC algorithm for image segmentation. The SLIC algorithm generates superpixels, which are regions of similar characteristics. The idea behind the proposed approach is that in regions having high luminance levels and not hindered by shadows or dark areas, light source variations can easily be detected, and thus, the ENF signal can be estimated more accurately. The first step of the proposed approach generates N regions with similar characteristics in the first frame of a video recording. Afterward, the mean intensity values ζ n (1), n = 1, 2, . . . , N of all regions in the first frame are computed and only those exceeding a predefined threshold τ are retained. Let ζ(1) be the vector with elements ζ n (1). If N = |{n : ζ n (1) > τ}| denotes the size of region mean intensity values exceeding the threshold, then the mean intensity value for the first frame is given as follows: where u(ζ n (1) − τ) denotes the Heaviside function.
In the next step, the generated regions from the first frame are located in all Λ frames of the video recording. For a video recording with a duration of 12 min, Λ = 21,600 frames. Employing these regions, the mean intensity values of the regions are computed and, then, the mean intensity value in each frame is calculated, as in (3). In this way, each video frame is represented by an intensity value x(t), t = 1, 2, . . . , Λ.
A non-parametric, namely the STFT, and a parametric method, i.e., the Estimation by Rotational Invariant Techniques (ESPRIT), were employed for ENF estimation. Hereafter, the frames, indexed by t, will be referred to as samples.
The STFT is one of the most common methods in time-frequency analysis of signals. Assuming stationary within the short-time segments of the signal, the Discrete-Time Fourier transform is computed for each time segment [47]: where w(t) denotes a window function of length L, X l (ω) is the discrete-time Fourier transform of the windowed data centered around lG, and G = S f s is the hop size in samples. The proper selection of window function constitutes a very important issue in STFT and, generally, in the majority of time-frequency analysis methods. This is because an optimal trade-off between time and frequency resolution is sought. Letφ l (ω κ ) ∝ |X l (ω κ )| 2 be the periodogram of the L = D f s samples long lth segment, where ω κ , κ = 0, 1, . . . , Q − 1 are the frequency samples with Q = 4 L. Specifically, the frequency sample ω κ that corresponds to the maximum periodogram value is extracted as a first ENF estimate. Afterward, a quadratic interpolation is employed to obtain a refined ENF estimate. ESPRIT is also employed to estimate the ENF signal. LetR be the sample covariance matrixR where stands for transposition and LetŜ be the subspace spanned by the W principal eigenvectors ofR. Let are the eigenvalues of the estimated matrixφ [48]: The frequency − 1 2π arg(v κ ) f s (in Hz), which is closest to the aliased base frequency is the ENF estimate. Here, m = 10 and W = 3.
The proposed approach combines the generation of the mean intensity time-series x(t) with either the ESPRIT or the STFT method. An outlook of the proposed approach is depicted in Algorithm 1.

Algorithm 1 Proposed SLIC-based approach for ENF estimation in video recordings.
Inputs: Number of video frames Λ, number of superpixels N, threshold τ, cut-off frequencies, segment duration L, number of overlapping segments V, ESPRIT parameters m and W, and reference ground truth. Output: Estimated ENF vector f . 1 Perform SLIC in the first frame of the video recording to generate N regions of similar characteristics and luminance, i.e., superpixels. 2 Compute mean intensity values ζ n (1) of each generated region. 3 The mean intensity values of regions exceeding threshold τ in the computation of x 1 . 4 Locate the generated regions in the Λ − 1 remaining frames and repeat steps 2-3 to compute x(t), t = 2, 3, . . . , Λ. 5 Having computed the 1-D time-series x(t), x(t) is bandpass filtered using the cut-off frequencies described in Section 2.1. 6 The filtered signal is split into V overlapping segments. Each segment is obtained by multiplying the filtered signal with an L-size rectangular window. Any segment is shifted from its immediate predecessor segment by S s. 7 In each segment, the prevalent frequency derived by the ESPRIT method is employed as the ENF estimate. In the case of STFT, the frequency that corresponds to the maximum periodogram value is extracted as the ENF estimate. 8 Compute the MCC between the estimated ENF and the reference ground truth.

Evaluation Metric
Having estimated ENF, a matching procedure is applied in order to objectively assess estimation accuracy. Having calculated the reference ENF captured by power mains, the MCC [49] is used to compare the estimated ENF from video recordings against the reference one. Let f = [ f 1 , f 2 , . . . , f K ] be the estimated ENF signal at each second. Let also g = [g 1 , g 2 , . . . , gK] forK > K be the reference ground truth ENF, which is known, andg(p) = g p , g p+1 , . . . , g p+K−1 be a segment of g starting at p. The following index is determined: where p = 1, 2, . . . ,K − K + 1 and c(p) is the sample correlation coefficient between f and g(p) defined as: In Section 4.9, Fisher's transformation was employed to assess whether the pairwise differences between the MCC delivered by the proposed approach and that of state-of-theart one are statistically significant at a significance level of 5%.

Results
The estimation of the ENF signal is significantly affected by the nature of video recordings. In static videos, ENF presence is not affected and, thus, estimation accuracy is much higher than that in non-static videos. There, continuous motion hinders ENF estimation accuracy. Many approaches aim at overcoming this difficulty. For this reason, the state-of-the-art approach for ENF estimation in video [3], which employs intensity averaging with the Multiple Signal Classification (MUSIC) method, examines whether the video to be analyzed is a static or a non-static one. For brevity, from now on, the state-of-the-art [3] approach for both static and non-static videos will be referred to as MUSIC. The proposed approach employs either ESPRIT or STFT after SLIC. The novelty of the proposed approach lies in the fact that CCD sensors capture a time snapshot using a global shutter mechanism, which makes the distinction between static and non-static video obsolete. Thus, the proposed approach is applied regardless of whether the video recording is a static or a non-static one. It is tested on six video recordings of escalating difficulty from the publicly available dataset [50]. These recordings are either static and non-static ones. A reference ground truth signal is also available. The results are compared to those obtained by MUSIC [3]. The video recordings of the dataset employed in the paper are publicly available (https://zenodo.org/record/3549379#.YUIK7bgzaUl, accessed in 8 September 2021).

Dataset
Six different video recordings were recorded in Vigo, Spain, at a nominal ENF 50 Hz. Two different cameras were employed, namely, a GOPRO Hero 4 Black and an NK AC3061-4KN without an anti-flicker filter [50]. The video recordings are named as mov i , i = 1, 2, 3, 4, 5, 6 and their types are listed in Table 2. Table 2. Types of six video recordings employed for ENF estimation.

Video Name
Video Type mov 1 static mov 2 static mov 3 non-static mov 4 non-static mov 5 non-static mov 6 non-static Recording mov 1 is closer to what is known as "white wall" video in the literature. Going a step further, it depicts a flat colored wall of low brightness. This kind of recording can be exploited to evaluate whether ENF variations can be embedded and, subsequently, estimated in such a static and seemingly noise-free environment. mov 2 is also a static video, which contains regions with different textures, brightness, and shadows. This video is more challenging than mov 1 . mov 3 can be categorized as a non-static video. It starts showing a white wall and a wooden table. Then, an object is placed on the table and a human hand rapidly shakes white papers at regular intervals on the right region of the recording. mov 4 is a non-static video, where human movement appears. It is a complex recording and consists of several textures. It takes place within an office, where a human is constantly moving. Both the background wall and the floor are captured. mov 5 constitutes one of the most challenging recordings, which resembles a real-life scene captured by a security camera. It is recorded within the complex environment of a room. The scene contains several objects with different colors and textures. The most significant challenge of mov 5 is that the movement affects the majority of the frames and more than 50% of the pixels of each frame. mov 6 represents another challenging video recording, which contains a constant movement of a person inside a room. The movement takes place close to the camera, affecting most pixels in each frame. In all cases, the camera is fixed. Sample frames of the video recordings are depicted in Figure 1. The estimated ENF signal is compared against a reference ground truth obtained from power mains.

Experimental Evaluation
The approach detailed in Section 2.1 was applied to the six video recordings and the estimated ENF was compared against the MUSIC [3] for static and non-static videos. Particularly, for static videos, the state-of-the-art approach [3] suggests averaging intensity values in each frame, while for non-static videos, intensity values are averaged within relatively static regions of each frame. In all comparisons, a rectangular temporal window was employed. The predefined threshold τ was set at MV /3, where MV is the median of N average intensity values within the generated regions in each frame. All approaches were implemented in MATLAB 2016a. A 64-bit operating system with an Intel(R) Core(TM) i7 − 5930K CPU at 3.5 GHz was used in the experiments conducted.

ENF Estimation in Static Video mov 1
The ESPRIT method was tested for ENF estimation in mov 1 . The static nature of mov 1 enables an accurate ENF estimation. The proposed approach, which employs the SLICbased segmentation and intensity averaging resulted in an MCC of 0.9926, outperforming the MUSIC [3] where the MCC was measured to be 0.9658. When STFT was employed, the MCC was found to be 0.8662. Different segment durations in ENF estimation affect the results obtained. The MCC was computed for various segment durations D, as depicted in Figure 2. When a segment duration of 1 s was employed, the proposed approach using the ESPRIT worked satisfactorily, yielding an MCC of about 0.79, while the MCC was measured to be about 0.5, when the MUSIC [3] was used. The performance of ENF estimation depends also on the filter order ν of the bandpass filter. The MCC is plotted versus various filter orders in Figure 3. The top performance of the proposed approach, employing the ESPRIT, is achieved when ν = 111. Despite mov 1 is a trivial recording, the proposed approach offers significant improvements in ENF estimation accuracy against the method in [3].

ENF Estimation in Static Video mov 2
The static recording mov 2 is more challenging than mov 1 due to different textures and various levels of luminance. The STFT was employed for ENF estimation yielding an MCC of 0.9704. The MUSIC [3] resulted in an MCC of 0.9466. The ESPRIT method achieved an MCC of 0.9526. In this case, there is a strong correlation between the proposed approach and the method in [3] w.r.t. segment duration. Smaller segment durations resulted in lower MCCs in both approaches. For longer segment durations, both approaches yielded a higher MCC, as shown in Figure 4. Similar behavior was noticed when different filter orders were employed. When the bandpass filter order ν = 81 was used, the top performance was observed. The MCC of the proposed approach employing SLIC+STFT for various values of bandpass filter order and segment duration is plotted in Figure 5. The proposed approach employing SLIC+STFT required about 627.2 s. The computational time of the MUSIC [3] one was approximately 704.7 s.

ENF Estimation in Non-Static Video mov 3
The STFT method was employed for ENF estimation. mov 3 is a challenging video depicting movements and different textures. Thus, ENF estimation is a non-trivial task. The STFT achieved an MCC of 0.9877, outperforming the method in [3], which reached an MCC of 0.9191. The ESPRIT method resulted in an MCC of 0.7271. As can be seen in Figure 6, the longer the segment duration, the more accurate the ENF estimation. The top result w.r.t. the MCC was measured for bandpass filter order ν = 51. In mov 3 , improper values of filter order can lead to a significant reduction in MCC. Increasing the segment duration usually results in a more accurate ENF estimation w.r.t. the MCC. In this experiment, it has been noticed that when a large value of bandpass filter order is employed, increasing segment duration deteriorates estimation accuracy. The impact of filter order in MCC is demonstrated in Figure 7. The computational time of the proposed approach employing SLIC+STFT was about 468.5 s, while the MUSIC [3] required 531.4 s.

ENF Estimation in Non-Static Video mov 4
The non-static video mov 4 captures a much more complex scene, where the human presence and movement is closer to real-life applications than the previous videos. Here, the STFT was employed for ENF estimation. The STFT yielded an MCC of 0.9837, which outperformed the MUSIC, which attained 0.8700 [3]. When the ESPRIT method was used, an MCC of 0.7605 was measured. The top performance was achieved for ν = 51. The MCC of the proposed approach employing SLIC+STFT for various segment durations is shown in Figure 8. MCC values of different segment durations and various bandpass filter orders are plotted in Figure 9. The computational time required by the proposed method employing SLIC+STFT was about 423.3 s, while the execution of the MUSIC [3] required 487.2 s to conclude.

ENF Estimation in Non-Static Video mov 5
Video mov 5 is one of the most challenging recordings. It resembles a scene captured by a security camera. Here, the STFT was employed for ENF estimation. The STFT achieved an MCC of 0.9432, outperforming the MUSIC [3] whose MCC was measured to be 0.8441 [3]. When the ESPRIT was employed, the MCC reached 0.8959. The MCC of STFT is plotted for various segment durations against the MUSIC [3] in Figure 10. When different values of bandpass filter order were employed, a longer segment duration was found to yield an increase in MCC, as can be seen in Figure 11. On the contrary, for a segment duration longer than or equal to 40, a plateau is noticed. The top MCC was achieved for a bandpass filter order of ν = 511. The execution of the proposed approach employing SLIC+STFT required 523.4 s to conclude, while the computational time of the MUSIC [3] was about 602.6 s.

ENF Estimation in Non-Static Video mov 6
Similarly to video mov 5 , mov 6 constitutes a challenging real-world indoor recording. This recording resembles a scene captured by a hidden camera under special conditions, which could hinder ENF estimation accuracy. Nevertheless, the proposed approach employing STFT resulted in an MCC of 0.9309, outperforming the MUSIC [3] whose MCC was measured to be 0.9115. The MCC of SLIC+STFT is plotted for various segment durations against the MUSIC [3] in Figure 12. The proposed approach performs better than the MU-SIC [3] for a segment duration of about 85 s. For shorter segment durations, the MUSIC [3] demonstrates a stable performance, outperforming the proposed SLIC+STFT. For different values of bandpass filter order, it is worth mentioning that by increasing segment duration, an increase in MCC is observed for all cases, as can be seen in Figure 13. The top MCC was achieved for a bandpass filter order of ν = 111. The execution of the proposed approach was 572.5 s. The execution of the MUSIC [3] method required 639.5 s to conclude.

Assessment of MCC Differences
In order to assess whether the improvements in MCC of the proposed approach, employing SLIC and either STFT or ESPRIT against the MUSIC [3] is statistically significant, and hypothesis testing was applied to all six recordings. The null hypothesis, H 0 : c 1 = c 2 , indicates that MCCs are equal and the alternative one, H 1 : c 1 = c 2 indicates the opposite.
For each video recording, the MCCs of the proposed approach and the MUSIC [3] undergo Fisher's z transformation [51]: The test statistic is given by: where K denotes the number of ENF samples. The test statistic q F is distributed as Gaussian with zero mean value and unit variance, for large K.  It is checked whether the test statistic q F falls within the region of acceptance for a significance level of 5%. If it does so, the null hypothesis H 0 is accepted and, thus, the differences between the MCC's are not statistically significant. On the other hand, if q F falls outside the region of acceptance (i.e., |q F | > 1.965), the alternative hypothesis H 1 is accepted, indicating that MCC differences are statistically significant. Statistical tests constitute an important contribution of the paper, offering a mechanism for making quantitative decisions, which can lead to accurate ENF estimation in practical forensic applications. The top MCC value of the proposed approach employing SLIC and either STFT or ESPRIT and that of the MUSIC [3] for each recording and the filter order employed is summarized in Table 3. In all cases in Table 3, q F was calculated and found to be outside the region of acceptance for significance level of 5%. Consequently, there is sufficient evidence to warrant the rejection of the null hypothesis. Therefore, the differences between the MCCs are statistically significant and the proposed approach yields statistically significant improvements in ENF estimation accuracy against the MUSIC [3].

Conclusions, Limitations, and Future Research
ENF estimation in static and non-static videos is a non-trivial task especially for complex environments comprising different objects, textures, and moving people. A novel automated approach has been proposed for ENF estimation in static and non-static videos recorded with CCD sensors. It is based on the SLIC algorithm for the generation of regions that share similar characteristics, especially luminance, where ENF variations can be precisely revealed. It has been demonstrated that the proposed approach, which applies either STFT or ESPRIT to a time-series created after SLIC, performs better than the MUSIC [3] in ENF estimation with respect to the maximum correlation coefficient. Moreover, the impact of two factors, namely, the segment duration and the bandpass filter order in ENF estimation accuracy, has been studied. Statistical tests have been conducted, attesting that the improvements in maximum correlation coefficient achieved by the proposed approach are statistically significant against the state-of-the-art approach, which employs the MUSIC method.
In this work, we have explored multiple videos recorded by a fixed camera. A scenario with a moving camera would possibly raise additional difficulties in finding areas of similar characteristics, which are employed in the proposed approach. Consequently, difficulties in accurately estimating the ENF estimate would be anticipated. In addition, although the recordings were of escalating difficulty, there was no more than one person present in the scene. It is difficult to predict whether the proposed approach would perform equally well in an unconstrained environment with a moving camera and scenes with many moving persons.
Future work will aim to extend this work by considering recordings that are captured by the rolling shutter mechanism of CMOS cameras. We are also interested in ENF estimation, when non-static cameras are employed. The latter scenario is very common in real-life applications due to the widespread use of mobile phones. Another challenging research direction is ENF estimation when multiple persons are recorded in the video.