Deterring Deepfake Attacks with an Electrical Network Frequency Fingerprints Approach

: With the fast development of Fifth-/Sixth-Generation (5G/6G) communications and the Internet of Video Things (IoVT), a broad range of mega-scale data applications emerge (e.g., all-weather all-time video). These network-based applications highly depend on reliable, secure, and real-time audio and/or video streams (AVSs), which consequently become a target for attackers. While modern Artiﬁcial Intelligence (AI) technology is integrated with many multimedia applications to help enhance its applications, the development of General Adversarial Networks (GANs) also leads to deepfake attacks that enable manipulation of audio or video streams to mimic any targeted person. Deepfake attacks are highly disturbing and can mislead the public, raising further challenges in policy, technology, social, and legal aspects. Instead of engaging in an endless AI arms race “ﬁghting ﬁre with ﬁre”, where new Deep Learning (DL) algorithms keep making fake AVS more realistic, this paper proposes a novel approach that tackles the challenging problem of detecting deepfaked AVS data leveraging Electrical Network Frequency (ENF) signals embedded in the AVS data as a ﬁngerprint. Under low Signal-to-Noise Ratio (SNR) conditions, Short-Time Fourier Transform (STFT) and Multiple Signal Classiﬁcation (MUSIC) spectrum estimation techniques are investigated to detect the Instantaneous Frequency (IF) of interest. For reliable authentication, we enhanced the ENF signal embedded through an artiﬁcial power source in a noisy environment using the spectral combination technique and a Robust Filtering Algorithm (RFA). The proposed signal estimation workﬂow was deployed on a continuous audio/video input for resilience against frame manipulation attacks. A Singular Spectrum Analysis (SSA) approach was selected to minimize the false positive rate of signal correlations. Extensive experimental analysis for a reliable ENF edge-based estimation in deepfaked multimedia recordings is provided to facilitate the need for distinguishing artiﬁcially altered media content.


Introduction
Modern Artificial Intelligence (AI)/Machine Learning (ML) technology is widely integrated with many multimedia applications to help enhance its applications, and General Adversarial Networks (GANs) enable the manipulation of audio or video streams seamlessly based on the probability distribution of each dataset class [1]. Since first introduced in 2015, the development of the generator and the discriminator module of the GAN has led to the generation of deepfaked images that are indistinguishable from real images [2]. Such high-resolution and accurate generation of images had found many applications in modern media. The potential applications of deepfakes include e-health/medical field, commercial applications, and secure privacy in media. With the capability to generate feature audio-video channels, the manipulation in the ENF signal with respect to time is treated as the manipulation or modification of the multimedia recordings [35][36][37]. The ENF signal is also used for forensic analysis of digital evidence, time of recording estimation [38], media synchronization among multiple channels [39], and geographical tagging of the recording [40].
Although the ENF signal is present in the audio and video channels, some challenges exist when using the ENF as a fingerprint mechanism. Due to the lower frequency range, the Signal-to-Noise Ratio (SNR) for reliable ENF estimation is vital to address. In typical deepfake videos, instantaneous frequency estimation is required for estimation, which depends on the spectral estimation techniques used. In order to adopt the ENF as a fingerprint technique, a solution is also needed to address the redundant ENF reference database for comparing the estimated ENF. This paper analyzes the ENF estimation techniques against deepfake audio and video recordings using different spectral estimation techniques and robust and reliable estimation in low SNR recordings. Our contributions in this paper are as follows: • Designing of an effective spectral estimation technique using both parametric and non-parametric methods for IF detection. • Utilizing a Robust Filtering Algorithm (RFA) over a weighted SNR to identify the harmonic ENF embedded in media recordings to enhance the ENF signal estimation in the identified ENF. • Implementing an effective detection technique against deepfake attacks and an integrated Singular Spectrum Analysis (SSA) based on the correlation coefficient values to reduce the number of false positives in a real-time video broadcasting scenario. • Demonstrating experimental analysis on the video and audio deepfake attacks' detection using the RFA technique and comparing its effectiveness against traditional spectral estimation techniques.
The rest of the paper is organized as follows. Section 2 discusses the background and related work in deepfake detection technologies and ENF fingerprint applications. Section 3 thoroughly discusses the spectral estimation techniques used in this work for the comparison analysis and the ENF enhancement techniques in low SNR recordings followed by the Singular Spectral Analysis (SSA) approach to minimize the false positive rate caused by correlation outliers in Section 4. Section 5 reports the experimental evaluation of the spectral techniques discussed in Section 3 and the performance evaluation of SSA in edgebased devices. Finally, Section 6 discusses the limitations along with alternate strategies, and we conclude this paper in Section 7.

Deepfake Detection Using Traditional and Trained Models
Deepfake detection has become a critical problem in digital media authentication. With advanced computational power and the developments in GANs, the resulting media output is very realistic [2]. However, along with its development, many detection techniques were proposed in the early stages to leverage the artifacts introduced in deepfakes. Artifacts such as eye blinking [41], facial distortion, facial symmetry construction [42], and motion artifacts can be visually inspected and identified [43]. Machine-learning-based models were also trained to identify the artifacts. However, the artifacts result from low training data and improvement in the GAN architecture; with more data, the artifacts can be reduced, and more realistic images can be created, leaving the visual-artifact-based detectors redundant.
Hidden features such as GAN fingerprints are unique to the deepfake model architecture [44], and biometric signatures such as heartbeat detection through the skin do not depend on visual artifacts [45]. The signatures can be reliable when the visual artifacts are removed by better training. The GAN also introduces frequency-level artifacts due to the upsampling method in the GAN pipeline [46], and the modified frames can be identified by frequency analysis and studying the compression map [47,48]. Noiseprint is one such fingerprint extracted by suppressing the high-level scene content and leveraging the in-camera processes for unique fingerprints [49]. Noiseprint is applied to reliably localize the frame modification with high performance. Other camera-based fingerprint techniques such as Photo Response Non-Uniformity (PRNU) sensor noise and JPEG compression artifacts were also used in detecting frame-level forgeries due to their dependence on the source device [50,51]. However, these unique artifact-based detectors can also be spoofed using a GAN-based approach where camera traces are inserted into the synthetic images [52]. Along with the reliability of the unique fingerprint for its detection capability, it is also essential that the fingerprint be less prone to forgeries. Hence, we adopted the ENF-based environmental fingerprint where the fluctuations are a random process and signal manipulation in media recordings leaves modification traces.

ENF Applications in Digital Multimedia
The ENF was initially introduced as a forensic verification technique for law enforcement applications to verify the authenticity of audio recordings [27]. Due to electromagnetic induction, the audio recorders directly connected to the power grid can also embed the ENF fluctuations in the audio recordings [28]. The applications were limited to devices connected directly to the power grid until the presence of the ENF was verified in battery-powered devices through the background hum generated by surrounding electrical appliances connected to the grid and increasing its range of devices [31].
Along with audio, video recordings were also discovered to carry ENF fluctuations in the form of illumination frequency [33,34]. The captured photons from artificial light have similar fluctuations, and the method estimation from the video recordings depended on the imaging sensor used in the capture device. Complementary Metal-Oxide Semiconductors (CMOSs) and Charge-Coupled Devices (CCDs) are the most commonly used imaging sensors with different shutter mechanisms [38]. In the case of CCD sensors, a global shutter mechanism is used where the whole sensor grid is exposed to photon capture at one instant, resulting in capturing the ENF samples equal to the number of frames per second. However, in CMOS, a rolling shutter mechanism captures the ENF sample per row in the sensor grid and vastly increases the captured samples [34]. Due to limited samples in the CCD sensor, an alternative aliasing frequency technique can be used to estimate the ENF fluctuations [33]; however, it is prone to signal noise. Most commercial-grade camera devices use CMOS sensors due to their cost-effective nature, resulting in an effective solution for ENF estimation through video recordings.
The presence of the ENF signal in audio and video recordings has increased its viable applications in identifying the recording time due to its unique fluctuation nature. Although the fluctuations in the ENF are similar throughout the power grid interconnect, the propagation delay can be used to identify the geographical location of the recording within the grid, essentially enabling the ENF technology with the geotagging feature [53]. ENF presence in audio and video recordings can be used to synchronize the media recordings from multiple recorders in commercial applications [39]. Smart grid infrastructure relies on ENF fluctuations to analyze power consumption, create a feedback loop for power outages, and prevent grid-level blackouts [30].

ENF-Based Digital Media Authentication
The ENF signal can essentially be used for both audio and video forgeries with its forensic capabilities. Modifications such as copy and move, frame replay, spatial modifications, and inserting external recordings can be identified using ENF inconsistencies [36,37]. Many ENF estimation techniques are already proposed using multiple spectrum estimation techniques and phase identifications. In this work, we focus on studying the effects of deepfake generation on the embedded ENF signal, deploy multiple spectral estimation techniques and verify their effectiveness, and analyze the robust and ENF-preserving techniques increasing the likelihood of efficient ENF-based authentication.

Robust ENF Estimation Techniques
ENF signal estimation primarily depends on a reliable Instantaneous Frequency (IF) estimation from the source recording. Due to the harmonics embedded with the nominal frequency, some harmonic frequencies have a higher SNR and can result in a reliable signal [54,55]. However, the noise interference in some harmonics can also completely interfere with the signal. With deepfake videos, the primary interruptions in extracting a reliable signal from the video are from the moving subjects [56]. The signal estimation is more efficient for a static background, but for a moving subject, there is additional noise embedded due to the pixel intensities' variation [35]. Other challenges include the duration of the audio and video recordings used. The duration is not a problem in a continuous stream of multimedia since the window can be larger. However, in the case of a limited recording length, the spectrum estimation for reliable frequency extraction becomes challenging. We aim to test the effects of parametric and non-parametric spectrum estimation techniques against deepfake videos for this scenario [57]. For non-parametric spectrum estimation, Short-Time Fourier Transform (STFT) was used to estimate the ENF, and in the case of parametric methods [27], we used the Multiple Signal Classification (MUSIC) algorithm [57]. Each method has its own merits in the case of computational power, reliable estimation, and the amount of data sequence required.

Non-Parametric Spectral Estimation Techniques
Non-parametric approaches do not assume that a model generated the data. The typical approach in this method is to use Fourier analysis, which can result in some higher computational cost. We utilized the Short-Time Fourier-Transform (STFT)-based approach in this work. The ENF signal fluctuations are represented as f ENF = f o + f ∆ , where f o is the nominal frequency and f δ is the instantaneous signal fluctuation. With the Fourier transform of the input signal x(n), the Power Spectral Density (PSD) is calculated from the spectrogram to estimate the spectral band from the harmonic frequency band (B) of interest . From the spectral band, the instantaneous frequencies in each frame window are estimated by the maximum value in the corresponding power density vector for that time instant. To improve the frequency estimation accuracy, quadratic interpolation can be used where the index of the frequency bin numbers is used to obtain the spectral peak. The peak location is given as where α is the previous bin of the max spectral bin, β is the max spectral peak, and γ is the next bin. If k * is the bin number of the highest spectral sample, then the resulting frequency estimate bin is adjusted by ∆, and the final interpolated frequency estimate is f s N Here, f s is the sampling frequency of the input signal and N is the number of FFT bins used. Although the input signal data length is not limited in a continuous input stream, such as a surveillance system audio/video feed, the number j of fast Fourier transforms FFT j where j = 1, . . . , J points can be increased for higher accuracy at the cost of increased computational resources. With known nominal frequency bounds, the ENF estimate from this technique can be accurate, but at the same time, if the energy peak is not in the bounds, then it is susceptible to outliers.

Parametric Spectral Estimation Techniques
The spectrum estimates using parametric methods result in a higher-quality spectrum. It requires less data compared to that of non-parametric methods. However, it is essential that the model parameters be estimated appropriately; otherwise, the estimated model could give wrong or misleading estimates. Among the parametric methods, in this paper, we adopted the Multiple Signal Classification (MUSIC) technique based on the subspace analysis of the signal and noise model [57].
MUSIC is a subspace-based frequency estimation model depending on the eigenanalysis of the observed discrete time signal data. For this algorithm, let the discrete time signal v(n) of finite length L with K sinusoidal components be represented as where A k = |A k |e φ k is the complex magnitude of the Kth harmonic signal component with φ k being unknown and assumed to be unknown and uniformly distributed in [−π, π] and w(n) is the noise.
For a data sequence of length With eigen-analysis onR v , the eigen vectors corresponding to the K signal subspace (U s ) are q 1 , q 2 , . . . , q K and the remaining q K+1 , q K+2 , . . . , q M span the noise subspace (U n ). Assuming a signal eigenvector e, then it must be orthogonal to the noise subspace eigenvectors e ⊥ q i for {q i } M i=K+1 , where e(ω l ) = [1, e jω l , e j2ω l , . . . , e j(M−1)ω l ] T , l = 1, 2, . . . , K The MUSIC algorithm defines a squared norm function: If the e vector belongs to the signal subspace, then d 2 = 0 due to the orthogonality condition. The reciprocal of the squared norm will result in sharp peaks at desired signal frequencies.

Robust ENF Enhancement Techniques
Multimedia recordings are often susceptible to noise interference. For reliable estimation of the ENF signal from the source recording, robust measures are needed. The following discusses the reliable techniques used for robust ENF estimation in a noisy environment.

Weighted Harmonics Combination
The ENF signal is embedded in multiple harmonics depending on the nominal frequency. For audio recordings, the ENF is present in either even or odd harmonics depending on the type of microphone used. Similarly, in video recordings, the harmonics are the multiple of illumination frequency, which is twice the nominal frequency. Other harmonics can be leveraged to obtain accurate fluctuations for noise interference in targeted frequencies. Therefore, a weighted combination of the harmonics' spectral bins can result in a noisetolerant spectrum for reliable frequency estimation [54]. The SNR values are computed as a ratio of the PSD (s( f )) in the ENF fluctuation range ( f c ) to the PSD in the spectral band of interest ( f v ). The optimal ENF fluctuation in the U.S. is ±0.02.
where L is the maximum number of harmonics carrying ENF fluctuations and can be combined. Using the weights computed for windowed spectral bins, the final spectrum S( f ) is evaluated, and quadratic interpolation can be used to estimate the spectral peaks and frequency fluctuations.
In our approach, we used weighted estimation from multiple harmonic bins to identify the frequency with the highest SNR compared to other harmonics. For audio recordings, due to the nature of the microphone used, the ENF is either embedded in the even harmonics or odd harmonics. Similarly, for video recordings, the ENF is embedded on the Frames Per Second (FPS), the illumination frequency, and the type of artificial light used. With the weight matrix, the ENF with the highest SNR is identified, and then, the following filtering algorithm is used to enhance the ENF in that frequency range.

Robust Filtering Algorithm
The RFA was proposed to improve ENF estimation in noisy interference [58]. Instead of reducing the noise after the spectrum is computed, the RFA approach enhances the estimation accuracy by improving the SNR and minimizing in-band noise prior to the ENF estimation.
In the RFA [58], a time-domain preprocessed input signal is encoded into the Instantaneous Frequency (IF) of the Sinusoid-Frequency-Modulated (SFM) signal. A kernel function is utilized to generate the Sinusoidal Time-Frequency Distribution (STFD) of the encoded signal, where the peaks correspond to the denoised ENF. For an optimal selection of the kernel function and the signal encoding, we recommend readers refer [58] for a detailed description of the algorithm. With the help of the RFA, the ENF can be reliably estimated under a −20 dB noise level. For deepfake videos, the underlying ENF, although captured by the imaging sensor from the artificial light, is interfered by the pixel noise and subject movement [56]. Therefore, the RFA technique can be used to minimize the noise, and then, a suitable spectrum estimation technique can be used to estimate the ENF.

ENF Similarity Verification Using the Correlation Coefficient
Authentication of the ENF carrying multimedia can only be verified when the estimated ENF fluctuations are not tampered with or modified. For this verification, we adopted the correlation coefficient as a measure of similarity to verify the estimated ENF signal from the recording (P ENF ) with the ground truth ENF (G ENF ) collected directly from the power grid. The value of the correlation coefficient (ρ) varies from [−1, 1], where 1 represents the highest similarity and vice versa.
where l represents the lag measure, N is the signal length, and µ is the mean. Although the reference ENF collected from the power grid is redundant for efficiently deploying this authentication scheme, we later discuss a distributed authentication system that relies on a consensus mechanism designed using the ENF fluctuations. Based on the consensus, the networked multimedia devices broadcast their estimated ENF, and a pseudo ground truth ENF is selected, which is used for authentication.

SSA for Anomaly Detection
With a reliable ENF signal estimated using the appropriate spectral estimation techniques, we integrated an anomaly detection scheme to analyze the correlation vector and detect major deviations from the historical trend. Figure 1 represents the algorithm flow including the signal estimation process for anomaly detection. The estimated ENF signal and ground truth reference signal were compared using the sliding window algorithm for continuous monitoring of the input media stream. Similarly, the generated correlation coefficients were analyzed for outlier detection and media forgery. The SSA algorithm decomposes the time series vector and performs Singular-Value Decomposition (SVD) for change-point detection [59] or future trend prediction [60]. The following section discusses the algorithm based on the correlation coefficient values for change-point detection analysis.

SSA Algorithm
The correlation coefficient as a similarity measure between the ENF estimated from the multimedia recordings and the ground truth ENF can generate false positives due to the signal noise. We integrated a Singular Spectrum Analysis (SSA) technique to regulate the fluctuations in the correlation coefficient and change-point detection [59,60]. The correlation coefficient samples are non-periodic in nature, and in order to integrate the SSA algorithm, a larger window size is required . Let ρ n+1 , ρ n+2 , ρ n+3 , . . . , ρ n+N , ρ n+N+1 , . . . , ρ n+N+Q be the non-periodic correlation coefficient samples collected from the online ENF comparison and N, M, l, p, and q be fixed integers, where n is iterative over new correlation coefficient values, N is the window size for the base matrix, Q is the window size for our test matrix with Q = q − p, and l < M ≤ N 2 . For each n = 0, 1, . . ., the following algorithm is executed: 1.
Creating the base matrix of size (M × K) using the initial correlation coefficient values and K = N − M + 1, Using the base matrix, also known as the Hankel matrix, we compute R = (X n B ).(X n B ) T , and the Singular-Value Decomposition (SVD) of the matrix R results in M eigen vectors and eigen values. Among the M eigen vectors, l < M eigen vectors are selected to create a group I. The group I consists of l-dimensional vectors in subspace L n,I of M-dimensional space R M . The eigen values computed from the matrix R are arranged in descending order, and the top l values are selected for the matrix I, respectively, such that the subspace L n,I consists of the features of R M .

3.
With the base matrix established, next, a test matrix is constructed of size (M × Q) with a lag p from the base matrix and Q = q − p. The resulting matrix is With the test matrix X n T and the l-dimensional subspace L n,I , the detection statistics of abnormal fluctuations in the input values can be calculated with the sum of the squared Euclidean distance between the column vectors of X n T and subspace L n,I . The column vectors of L n,I are represented as U i 1 , U i 2 , . . . , U i l . The detection statistics D n,I,p,q for n iterating over {0, 1, . . .} is given as, With the iterating values, the detection scores are normalized and represented as S n = D n,I,p,q µ n,I

6.
The Cumulative Sum of deviations (CUSUM) in the detection statistics are then calculated to eliminate false positives and seek major changes in the input values. A threshold h is used to detect the fluctuations in the correlation coefficient of the ENF values. The detection score is where (a) + represents max(0, a).

Prototype Implementation
In our experiments, the DeepFaceLab software was adopted to create video deepfakes [61], and Descript was used to create audio deepfakes [62]. The DeepFaceLab software is capable of generating deepfakes in real-time using face swapping and mapping to the original face by modifying the surrounding pixels. For audio deepfakes, a training time as little as ten minutes of target audio can be used to recreate a deepfake voice to mimic the targeted actor. Software such as this made easily available with almost no usage complexity can only result in more generation of fake media. In this paper, we study the effects of multiple spectrum analysis against deepfake modification and use a signal enhancement technique to estimate the reliable signal to localize the forgery. Figure 2 presents the overall architecture of the prototype implementation consisting of multiple edge clients and an edge server. The computational complexity and resource allotments are shared across multiple devices for enhanced ENF estimation. Each client collects real-time streams from cameras and then extracts ENF fingerprints, which are used for spectral estimation at the edge devices and SSA detection at the powerful edge server. The deployment of our proposed approach was analyzed on both edge-based devices (Raspberry Pi) and an edge server (desktop), and a detailed performance analysis of computational resource consumption is also presented.

Effects of Spectral Estimation Techniques against Deepfakes
Spectral estimation techniques have different parameters to control for a reliable estimation. In this work, we used MUSIC-and STFT-based spectral methods. The spectrum is computed from both techniques along with a bandpass filter along the nominal frequency of interest. In the STFT method, the spectral harmonic bands are collected from the spectrogram, whereas the MUSIC method looks for K complex exponential components in the signal. For the ENF signal, the value of K is two. Once the spectrum is computed for both techniques, the maximum frequency bin is identified with the help of quadratic interpolation, and the required frequency fluctuations are estimated. Although each spectrum estimation method has its own advantages, for deepfake videos, it is important that the method be more consistent and fast. MUSIC performs better with a lower signal length, whereas STFT relies on the Fourier transform, which needs more data for its computation. In Figure 3, the performance of each method is measured with respect to the input signal. The ENF was estimated from a static background recording under artificial light along with the ground truth reference signal with a sampling rate of 1000 Hz. The input signal was incremented by six seconds for each round, and the correlation coefficient was measured for similarity with the reference signal. Figure 3 clearly shows that MUSIC performed better with lower-duration recordings, but STFT outperformed the MUSIC method with sufficient input data given. With the performance analysis based on the duration of the input signal, next, the STFT and MUSIC algorithms were tested with a fixed-length input signal on its ENF estimation on video recordings. The video recording used includes a talking head subject with movements recorded under artificial light. Figure 4 represents the correlation coefficient for STFT and the MUSIC algorithm. Here, it is clear that STFT performed better in video-based ENF estimation compared to the MUSIC algorithm. For a reliable deepfake manipulation detection, it is vital that the ENF from unmodified recordings be estimated more reliably, so that any significant changes in the ENF can be marked as a potential manipulation. For some audio and video analysis, the harmonic frequencies in the recordings are targeted with external noise interfering with the embedded ENF. For this purpose, we used reliable estimation techniques such as the weighted combination of multiple harmonics [54] along with the robust filtering algorithm proposed in [58]. Table 1 represents the average SNR of the ENF fingerprints in the media recordings. Using the SNR matrix, the targeted frequency range was identified and the RFA algorithm was used to increase the SNR significantly. From Table 1, for power and audio recordings, the ENF signal is stronger in odd harmonics, and in video recordings, it is stronger in even harmonics since the nominal frequency in videos is 120 Hz.

ENF Enhancement Using the RFA
The efficiency of the RFA was tested on real-world audio recordings with ENF embedded though background hum. We used the STFT algorithm to estimate the harmonic frequency with the highest SNR and estimate the ENF signal. However, due to external noise, the ENF estimated from a single harmonic frequency had a significantly lower correlation coefficient when compared with its reference ENF. Using the RFA to enhance the ENF in the frequency of interest, the noise was suppressed, and then, the ENF was estimated from the new spectral bins. Figure 5 represents the correlation coefficient histogram of the audio ENF before and after applying the RFA to the recordings. Although it was unclear if the recordings carried any ENF signature before the RFA, it can be clearly seen that the RFA enhanced the embedded ENF and was more reliable for better ENF estimation from recordings with a lower SNR. ENF estimation from video recordings depends on the presence of artificial light in the recording. With lower-intensity light in the background, the ENF is not reliable due to the interference of other in-camera noise such as ISO sensor noise and other subject-movementrelated pixel disturbances. We tested the performance of enhancing the harmonics of the ENF in a video recordings with lower illumination intensity and noise. In Figure 6, we used the STFT method to estimate the ENF from the video recording with and without the RFA to enhance the ENF harmonics. With the improvement in the correlation coefficient of the RFA-enhanced ENF signal, the ENF can be reliably estimated from video recordings with a lower SNR, as long as it carries the embedded artificial light fluctuations. Attacks on audio and video recordings such as deepfakes alter the original samples of the recordings to create a false perception. Along with the samples, the embedded ENF frequency fluctuations, which are temporal sensitive, are also altered, resulting in interference of the ENF fingerprint. Using a reference signal recorded at the same time instant, the manipulations to the multimedia recordings can be detected and also localized with a reliable ENF estimation method [35,56]. Figure 7 shows the drop in the correlation coefficient of the audio and video deepfake recordings where the ENF was estimated from the RFA-enhanced harmonics. For the video deepfake, the whole recording was swapped with an alternate trained face model, and this resulted in a drop in the overall correlation for the whole video. For the audio recording, a partial deepfake recording was generated and appended to the original recording. The correlation coefficient can also be used to localize the forgery. In order to deploy the proposed authentication scheme to reliably authenticate a continuous stream of media input such as surveillance system monitoring, the ENF should be reliably estimated for better correlation. However, sometimes, it is susceptible to outliers that occur due to a frame being skipped due to network delay or frame obfuscation. In order to address the outliers and reduce the false alarm rate, we integrated Singular Spectrum Analysis (SSA) to observe the correlation coefficient vector and suppress the outliers. The performance analysis and the computation overhead were studied from the perspective of edge-based computers.

SSA Performance Analysis
We evaluated SSA performance in terms of processing time and computational resource consumption on the host machine. During our test, we only repeated SSA functions, then evaluated the processing time and resource usages. Thus, deepfake video preprocessing was not considered in this test. Table 2 describes the devices used for the experimental study for SSA detection. The prototype was deployed on a small-scale Local Area Network (LAN) that consisted of multiple desktops and IoT devices. We used the desktop to simulate a fog server, while RPi devices to simulate edge servers. In deepfake attack scenarios, an adversary attempts to use forged or duplicate audio and video streams to fool video surveillance systems. Figure 8 shows how SSA detection identifies suspicious activities. We simulated attack scenarios that inject fake multimedia streams in attack ranges 200-240 and 400-430. Owing to the randomness and unpredictability of the ENF in streams, the injected audio or video streams demonstrate a very low ENF coefficient score by compared with the ground truth ENF. As a result, multimedia streams in attack ranges have W n higher than threshold h, and they will be marked as suspicious. Instead of relying on the experimental threshold of 0.8 used to compare the correlation coefficient factors, here, we rely on change-point detection of the SSA decomposition, where a structural change is detected if W n > h, as mentioned in Equation (3) [59].
Moreover, our SSA detection can also tolerate noise influence to mitigate false alarms. Figure 8 shows that spontaneous detection points with noise influence do not significantly change W n .  The detection statistics' calculation in Stage D took the longest time, as it needed more computational resources to perform singular-value decomposition on the Hankel matrix and computing the Euclidean distances between the base matrix and test matrix. The detection score Stage S simply converts D into the normalized sum of squired distances Sn, and then, the W stage calculates the CUSUM statistics. Thus, they had less process latency than stage D. As a result, the processing time of the D stage dominated the total latency of executing SSA on all three platforms. To evaluate the run time overhead of executing SSA detection on the host machine, only one core was used to run the SSA detection thread. We used the top command to monitor the running status of the SSA detection thread and obtained the CPU percentage distribution and average memory usage. Figure 9b shows the CPU usage percentage of executing SSA detection given different devices. Owing to different computing capability, executing SSA detection on the device with a powerful CPU core had a low mean and deviation of CPU usage percentage (desktop < RPi4 < RPi3).
The memory usages during the SSA detection were: desktop: 96 MB, Rpi4: 99 MB, Rpi3: 72 MB. Executing our SSA on different platforms may show different memory usages owing to the heterogeneous CPU architecture (X86 vs. ARM), different OSs (Ubuntu vs. Raspbian), and even various system running statuses. However, the gap was marginal, and both the desktop and Rpi almost demonstrated the same memory cost as executing SSA detection. Moreover, memory usage also included the cost due to OS tasks, such as managing the thread, but the majority of the contribution came from the SSA algorithm's execution. Therefore, it can be used to approximately evaluate the memory cost by SSA detection.

Discussion
Fake media generation using deepfake technology has raised significant concerns, and we have witnessed multiple "attacks". Although "nice" deepfaked AVS may bring benefits in multiple fields, preventive measures to distinguish fake media from authentic counterparts are necessary to prevent negative impacts. This paper introduces an ENFfingerprint-based approach to analyze the deepfake-generated media recordings and detect manipulations. While the ENF is verified in indoor audio and video recordings due to the presence of devices connected to the power grid, there are certain limitations. For example, recordings that are generated in outdoor settings or in scenarios where the presence of the ENF is negligible, the proposed ENF authentication is not applicable. Meanwhile, in applications such as conferencing calls that occur in indoor settings under artificial light, the ENF-based detection scheme is an effective countermeasure [63].
The number of samples collected for the ENF is also a key factor for reliable estimation of the embedded signal. Video cameras based on CCD imaging sensors use a global shutter mechanism, where the ENF-carrying samples are restricted to the number of frames collected. Due to the low sampling rate and the higher nominal frequency, the Nyquist criterion is not satisfied [33]. However, the aliasing frequency can be used to identify the signal fluctuations at the cost of reduced accuracy. With the proposed integration of the RFA and a weighted combination of spectral harmonics for ENF enhancement, the algorithms add additional complexity for time-sensitive applications such as surveillance systems. The RFA approach is used to enhance the SNR of the signal prior to the ENF estimation algorithm and thereby is suggested to be applied in scenarios where the external noise interference disrupts the ENF signal noise level down to −20 dB. However, with the increase in signal duration, the time taken for the RFA also exponentially increases compared to the spectral combination method [58]. Figure 10 presents the difference in the execution time for each enhancement algorithm, where the input signal is incremented for 12 s for each round. In order to integrate the RFA with the STFT-based ENF estimation, we used a sliding window approach with a window size of 45-60 s and a shift size of five to ten seconds. For an online detection system, manipulations made to the live feed are detected in less than ten seconds of occurrence provided the ENF enhancement and SSA algorithms are integrated. ENF authentication is not restricted to specific media types, unlike other trained models that depend on input compatibility. It is applicable to audio and video authentication and results in a generalized solution against media manipulation attacks. In our presented work, the media manipulations were detected using an external reference ENF signal, also referred to as the ground truth signal. Deploying external circuity for this purpose could be redundant, and a central reference ENF database would not be effective since the ENF is different for each power grid. Instead, a distributed authentication scheme could be adopted where the ENF estimated from each device can be used to generate a ground truth signal without relying on an external reference signal [64]. Our previous work proposed a consensus mechanism for edge-based devices to estimate the ENF for continuous media input. The broadcast ENF was used to create a mutually agreed ground truth signal, allowing for detecting any faulty nodes. We recommend our prior work on the ENF-based consensus algorithm to detect forgery attacks for further discussion [56,63,64].

Conclusions
Emerging technologies such as deepfakes have become a common source for generating misinformation to affect trust in online media. Different from existing work on deep-learning-based detection models trained to identify deepfakes, we tackled the problem of identifying frame manipulations such as deepfakes using an environmental fingerprint technique. Using the Electrical Network Frequency (ENF) signal embedded in media recordings through artificial power sources, the integrity of the recording can be verified in both the spatial and temporal domains. In this work, we present a comprehensive analysis of effective spectral estimation techniques such as Short-Time Fourier Transform (STFT) and Multiple Signal Classification (MUSIC) against low Signal-to-Noise Ratio (SNR) media recordings. Our experimental results concluded that STFT is more reliable for ENF estimation. However, according to our findings, for media recordings with a short duration, the MUSIC algorithm has better performance for spectral estimation.
In addition to spectral analysis techniques, we tested signal enhancement algorithms such as the Robust Filtering Algorithm (RFA) and weighted harmonics combinations against deepfake audio and video recordings. From our experiments, the RFA technique significantly improved the SNR of the embedded ENF signal and resulted in reliable verification of signal authenticity. We also integrated the proposed method for online media verification, and based on the experimental results, we adopted STFT with the RFA algorithm considering the execution time complexity in our testbed. Furthermore, to minimize the false positive rate due to outliers, we deployed our ENF-based authentication scheme with the Singular Spectrum Analysis (SSA) method to improve the performance of detecting media manipulations. The results demonstrated a reliable and comprehensive tool against fake media distribution, adaptable to heterogeneous media recordings made under the influence of the power grid.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: