Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy

Cedillo-Hernandez, Antonio; Velazquez-Garcia, Lydia; Cedillo-Hernandez, Manuel; Dominguez-Jimenez, Ismael; Conchouso-Gonzalez, David

doi:10.3390/math13152493

Open AccessArticle

Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy

by

Antonio Cedillo-Hernandez

¹

,

Lydia Velazquez-Garcia

²,

Manuel Cedillo-Hernandez

^3,*

,

Ismael Dominguez-Jimenez

⁴

and

David Conchouso-Gonzalez

¹

Tecnologico de Monterrey, Escuela de Ingenieria y Ciencias, Av. Eugenio Garza Sada 2501 Sur, Col. Tecnologico, Monterrey 64700, Nuevo León, Mexico

²

Instituto Politecnico Nacional, Centro de Investigaciones Economicas, Administrativas y Sociales, Lauro Aguirre 120, Agricultura, Ciudad de MeXico 11360, Mexico

³

Instituto Politecnico Nacional, Escuela Superior de Ingenieria Mecanica y Electrica, Unidad Culhuacan, Av. Santa Ana 1000, San Francisco Culhuacan, Coyoacan, Ciudad de Mexico 04440, Mexico

⁴

Universidad Autonoma del Estado de Hidalgo, Escuela Superior de Tlahuelilpan, Sergio Butron Casas 19, La Rancheria, Col. Centro, Tlahuelilpan 42780, Hidalgo, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2493; https://doi.org/10.3390/math13152493

Submission received: 14 July 2025 / Revised: 26 July 2025 / Accepted: 31 July 2025 / Published: 3 August 2025

(This article belongs to the Section E: Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a robust and imperceptible video watermarking framework designed for blind extraction in dynamic video environments. The proposed method operates in the spatial domain and combines multiscale perceptual analysis, adaptive Just Noticeable Difference (JND)-based quantization, and temporal redundancy via multiframe embedding. Watermark bits are embedded selectively in blocks with high perceptual masking using a QIM strategy, and the corresponding DCT coefficients are estimated directly from the spatial domain to reduce complexity. To enhance resilience, each bit is redundantly inserted across multiple keyframes selected based on scene transitions. Extensive simulations over 21 benchmark videos (CIF, 4CIF, HD) validate that the method achieves superior performance in robustness and perceptual quality, with an average Bit Error Rate (BER) of 1.03%, PSNR of 50.1 dB, SSIM of 0.996, and VMAF of 97.3 under compression, noise, cropping, and temporal desynchronization. The system outperforms several recent state-of-the-art techniques in both quality and speed, requiring no access to the original video during extraction. These results confirm the method’s viability for practical applications such as copyright protection and secure video streaming.

Keywords:

video watermarking; perceptual model; temporal redundancy; QIM; JND; blind extraction; copyright protection

MSC:

94A08; 68M25

1. Introduction

The digitalization of multimedia content and the prevalence of high-bandwidth communication networks has considerably changed the production, distribution, and consumption of information. Video content has emerged as a predominant medium across multiple sectors, including entertainment, education, social media, and surveillance. The simplicity of duplicating, modifying, and redistributing digital video has generated significant apprehensions about intellectual property safeguarding, content verification, and data integrity [1,2]. These issues have been increased by the widespread use of compression standards, transcoding across platforms, and malicious interference, which may distort or eliminate identifying information.

Digital watermarking has emerged as a suitable solution for these issues. It allows embedding copyright or authentication data directly into the host media in a way that is imperceptible to human observers while remaining robust to signal processing and malicious attacks. A wide range of watermarking schemes have been proposed for still images [3,4], audio signals [5], and digital videos [6,7]. While image watermarking has matured considerably, the domain of video watermarking still presents unique challenges due to the temporal dimension, larger data volumes, and frequent exposure to aggressive processing such as compression, frame rate conversion, and cropping [8].

Two dominant paradigms exist in video watermarking: transform-domain and spatial-domain approaches. Transform-domain techniques, such as those based on the Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT), or Singular Value Decomposition (SVD), typically offer superior robustness but at the cost of high computational complexity [9,10,11]. Spatial-domain methods, in contrast, offer lower computational overhead and are easier to implement in resource-constrained environments, but they often lack sufficient robustness under aggressive signal modifications [12,13].

Recent advances have attempted to bridge this trade-off. One promising direction is the use of Just Noticeable Difference (JND) models to guide watermark embedding in the spatial domain, leveraging the human visual system (HVS) to determine perceptually insensitive regions [14,15]. By incorporating saliency information or texture masking, such methods can adaptively modulate the watermark energy, balancing imperceptibility and robustness. For instance, the proposal in [16] presented a saliency-modulated JND model for video frames that enhances the watermark strength in low-attention regions. Similarly, Wan et al. [14] demonstrated effective image watermarking by combining color sensitivity with JND thresholds.

Despite these improvements, key limitations persist. First, the majority of spatial-domain schemes rely on static, frame-level decisions, neglecting the temporal dynamics and redundancies present across frames. Second, the embedding locations are often selected uniformly or randomly, without regard to perceptual importance, spatial stability, or semantic content. This can lead to suboptimal robustness, especially under cropping, compression, or transcoding attacks [17]. Third, many schemes utilize single-frame insertion without redundancy, resulting in high Bit Error Rates (BER) during extraction in realistic conditions [18,19].

To overcome these limitations, researchers have increasingly explored the integration of perceptual modeling, adaptive modulation, and channel coding techniques. Among modulation strategies, Quantization Index Modulation (QIM) has gained attention due to its theoretical robustness and simplicity. QIM embeds watermark bits by quantizing selected host features to different intervals based on the message bit [20]. While originally designed for still images, it has been successfully adapted for videos when combined with temporal synchronization strategies [21,22]. Redundancy is another critical aspect of robust watermarking. Techniques such as block-wise repetition, majority voting, and forward error correction (e.g., BCH or Reed–Solomon codes) have been used to counteract localized data loss. Moreover, temporal redundancy, spreading the watermark across several frames, has been shown to increase robustness against frame dropping, re-encoding, and desynchronization [23,24]. However, without careful selection of embedding blocks and modulation levels, redundancy alone may not yield sufficient gains.

In this work, we propose a perceptually guided, fully blind watermarking method that integrates adaptive block selection, JND-based QIM modulation, and multiframe redundancy. By unifying these strategies in a lightweight spatial-domain pipeline, our method achieves strong robustness against noise, compression, and geometric attacks while maintaining high visual fidelity and low computational cost. A detailed summary of our novel contributions is presented in the following subsection.

Main Contributions

Although our proposed method leverages known building blocks such as QIM-based embedding and JND-guided perceptual modeling, the contribution of this work lies in the coherent integration, optimization, and evaluation of a full spatial-domain watermarking pipeline tailored for video content. Below we highlight the core technical novelties and distinctions from the prior literature:

Perceptual-driven block selection: Unlike methods that uniformly or randomly select embedding regions, we combine multiscale Laplacian-based texture analysis, spectral saliency contrast, and semantic masking (face/object detection) to create a composite perceptual map. This map drives an adaptive, content-aware selection of robust insertion blocks.
Spatial-domain DCT estimation: Instead of computing full DCT transforms, we estimate selected coefficients directly from the spatial domain using analytical approximations. This improves computational efficiency while preserving compatibility with frequency-domain robustness.
Adaptive QIM modulation using JND maps: The quantization step is locally modulated using a JND model that incorporates luminance, spatial masking, and contrast features. This improves imperceptibility while preserving watermark recoverability.
Redundancy via multiframe embedding: We implement a lightweight redundancy mechanism that spreads the watermark across multiple scene-representative frames, selected through histogram-based keyframe detection. Majority voting is used for recovery, improving resilience to temporal desynchronization, noise, and compression.
End-to-end blind architecture with low complexity: The entire method operates without access to the original video at extraction time, and is designed for low runtime overhead, enabling practical deployment in resource-constrained environments.

To the best of our knowledge, no prior work integrates these perceptual, modulation, and temporal strategies in a fully blind, spatial-domain video watermarking system with such a favorable balance of robustness, imperceptibility, and runtime performance.

2. Related Works

2.1. Overview of the Original Method

In our previous work [25], we proposed a computationally efficient and perceptually guided video watermarking method operating in the spatial domain. The method was designed to improve robustness while maintaining imperceptibility and low processing cost, which are three competing objectives in watermarking research.

Three core contributions were central to the method:

Spatial-domain DCT-based embedding: Rather than computing the full DCT/IDCT transform on every frame, we developed a method for estimating and modifying selected DCT coefficients directly through spatial operations, dramatically reducing computational burden.
Visual attention-modulated JND profile: Watermark energy was determined by combining JND thresholds with saliency analysis in a perceptual model that was computed on keyframes. This allowed stronger embedding in visually insensitive regions while preserving perceptual transparency [16].
Scene-based watermark modulation: The video was segmented into scenes, each represented by a keyframe [26]. Watermark parameters were calculated only on these keyframes and propagated through the rest of the scene using motion tracking and saliency consistency, reducing redundant computation.

The method was validated across video sequences of varying resolution and complexity, achieving high imperceptibility and robustness with a processing time reduction of over 80% compared to frame-by-frame transform-domain approaches. However, the original method relied on frame-specific, non-redundant insertion and lacked advanced block selection and encoding schemes to handle harsh or compound attack scenarios.

Although this proposal extends our previous work [25], it introduces several critical enhancements that restructure the original method. First, the current method employs a more sophisticated perceptual modeling framework by combining multiscale detail analysis, saliency detection, and semantic masking, as opposed to the earlier gradient-based texture selection. Second, rather than uniformly embedding watermark bits, the proposed scheme includes a block selection strategy based on perceptual scores to prioritize visually insensitive yet structurally robust regions. Third, while both methods operate in the spatial domain, the current technique incorporates an adaptive JND-based modulation that adjusts embedding strength based on local perceptual thresholds. Fourth, although DCT estimation was previously applied, the present method generalizes this approach with flexible coefficient targeting and a formalized QIM embedding mechanism. Finally, redundancy is no longer limited to static repetition; instead, it is systematically distributed across multiple frames using keyframe selection and majority decoding. These differences not only improve robustness under signal and geometric attacks, but also ensure greater imperceptibility and processing efficiency, as demonstrated in Section 4 and Section 5.

2.2. Redundant Encoding Strategies in Watermarking

Robust watermarking often necessitates encoding strategies that introduce redundancy to ensure correct retrieval under channel degradation or intentional attacks. Several techniques are well-documented in the literature.

QIM, as first introduced in [20], provides a mathematically grounded method for embedding bits into host features via quantization. QIM has been shown to achieve near-optimal performance in terms of robustness, imperceptibility, and payload, making it a widely used baseline in image and video watermarking [27]. Its adaptability and resilience have led to numerous variants, including combining QIM with transform-domain features (e.g., DCT, DWT) or motion vectors in video [24].

Repetition codes represent a simple, yet effective strategy wherein watermark bits are duplicated across multiple blocks or frames, with retrieval based on majority voting [28]. While this improves robustness against localized corruption, it severely limits payload and may be ineffective in uniformly distributed attack scenarios.

A more powerful class of encoding methods includes forward error correction (FEC) techniques such as BCH and Reed–Solomon codes. These allow structured redundancy and the correction of multiple errors without retransmission. BCH codes, in particular, have found success in both image and video watermarking systems subjected to compression or transmission distortion [11,13]. For example, Nayak et al. [29] applied BCH to protect mobile video watermarking against heavy JPEG quantization, showing a substantial decrease in BER.

In high-security scenarios, researchers have further explored hybrid coding schemes that combine QIM with FEC or chaotic modulation to boost tamper-resistance. These strategies, while often computationally heavier, have proven effective under strong desynchronization, geometric distortions, and heterogeneous transcoding [23,30].

Still, these schemes often assume a fixed or static embedding region, highlighting the need for perceptually guided block selection to complement the encoding scheme.

2.3. Block Selection for Robust Embedding in Video Watermarking

The effectiveness of a watermarking method strongly depends on where the watermark is embedded. Poor selection of blocks, such as those in uniform or highly visible regions, can lead to poor robustness and degraded visual quality. Early methods relied on uniform grid sampling or random block allocation to select embedding positions [31]. These approaches are simple but disregard the key perceptual and structural properties of the content. In modern systems, this has largely been replaced by perception-driven selection using models of the HVS.

A common strategy involves computing visual saliency maps to identify low-attention regions. Saliency-based methods such as the spectral residual approach or contrast-based filtering have been used to isolate background areas more suitable for embedding [16,32]. More recent approaches utilize deep saliency networks to capture context-aware attention fields [33]. Others incorporate frequency analysis, selecting blocks with strong mid- or high-frequency content (e.g., edges or textures), as these are less affected by quantization and compression. This is highly effective in DCT-based watermarking [10].

In addition, semantic masking has gained momentum. This approach explicitly avoids placing watermarks in regions containing high-level semantic information, such as faces, text, or logo areas. Luo et al. [23], for instance, developed a multiscale framework that combines saliency with object detection to guide watermark placement away from semantically meaningful zones. Some systems go further, using learning-based models to classify regions by expected robustness. Li et al. [34] proposed a CNN-based robustness predictor for image watermarking that could be extended to video.

3. Proposed Method

3.1. General Architecture

The proposed watermarking framework integrates perceptual modeling, adaptive modulation, and temporal redundancy in a unified, computationally efficient pipeline. This design addresses key challenges in video watermarking: robustness against attacks, imperceptibility to the human visual system, and scalability in processing. The pipeline is divided into six stages: perceptual map computation, robust block selection, calculating DCT coefficients directly from spatial domain, JND-modulated embedding using Quantization Index Modulation (QIM), redundancy encoding with temporal distribution, and final extraction with majority decoding. Each stage is described below.

3.1.1. Step 1: Perceptual Map Computation

The perceptual score map used in this method is computed by combining multiscale texture information, saliency estimation, and semantic region masking. This combination was chosen to ensure that the watermark is embedded in regions that are structurally rich, perceptually tolerant, and semantically non-critical. Specifically, multiscale texture analysis using Laplacian of Gaussian filters captures local detail stability across resolutions and has been shown to improve watermark persistence in textured areas [35,36]. Contrast-based saliency models, such as spectral residual methods, identify low-attention regions where modifications are less likely to be noticed by the human visual system [37]. Finally, semantic masking ensures that perceptually sensitive or meaning-bearing regions (e.g., faces, objects) are excluded from embedding, aligning with findings from object-aware watermarking and attention-preserving methods [38].

This combination creates a perceptual score map that balances imperceptibility with structural robustness, guiding block selection accordingly. Each frame is divided into nonoverlapped 8 × 8 blocks

B_{i, j}

. For each block we compute the following:

A multiscale detailed map using Laplacian of Gaussian filters applied at M different scales ( $D_{i, j}$ ) [36]. This map captures texture richness and edge strength across multiple spatial scales. The Laplacian of Gaussian ( $L o G$ ) filter enhances local variations, such as corners and edges, which are less perceptually sensitive. The equation is as follows:

$D_{i, j} = \frac{1}{M} \sum_{k = 1}^{M} | L o G (I, σ_{k}) | (i, j),$

(1)

computes the average response across M different Gaussian scales $σ_{k}$ , typically chosen in the range between 1 and 3. High values of $D_{i, j}$ indicate structurally complex areas where watermark energy can be safely embedded.
The saliency map $S (i, j)$ is computed using spectral residual contrast [37]. This map identifies low-attention regions in the frequency domain based on spectral irregularities. This approach has been widely used due to its simplicity and effectiveness in modeling human visual attention. Since this method is used without modification, full mathematical details are omitted here for brevity.
In this work, a semantic mask $M (i, j)$ is constructed using the Haar cascade frontal face detector proposed by Viola and Jones [38], applied to exclude sensitive regions such as faces or foreground objects from the embedding process. This ensures that watermark insertion is avoided in semantically important areas, preserving visual fidelity. No changes were made to the underlying detector beyond default parameters. While more accurate methods based on deep learning (e.g., YOLO, RetinaFace) are available, Haar cascades provide a good balance between performance and computational cost, making them suitable for real-time and embedded watermarking applications.

The final perceptual score map is computed by combining the three components with empirically determined weights

α

,

β

, and

γ

:

P_{i, j} = α \cdot D (i, j) - β \cdot S (i, j) - γ \cdot M (i, j),

(2)

where the weight values are

α

= 0.4,

β

= 0.3, and

γ

= 0.3. Blocks with higher

P_{i, j}

scores are prioritized for embedding. Figure 1 visually illustrates the four key components used to construct the perceptual score map applied to the video sequence “Akiyo”. The subfigures show the following: (a) a multiscale detail map capturing high-frequency edges via Laplacian filtering; (b) a saliency map derived from spectral residuals, highlighting perceptually sensitive regions to be avoided; (c) a semantic mask that detects and excludes face regions; and (d) the resulting perceptual score map that combines these cues. This visualization helps clarify how the method selects blocks that are structurally rich yet perceptually tolerant, supporting the adaptive block selection described in Section 3.1.2. Step 2.

3.1.2. Step 2: Robust Block Selection

A critical aspect of any watermarking scheme is choosing regions where embedded data will remain intact after typical processing or attacks. This step ensures robustness by selecting blocks with favorable perceptual and structural properties, as determined by the perceptual score map from Section 3.1.1. Step 1.

The goal is to prioritize blocks with high tolerance to distortion and low perceptual visibility. Blocks with complex texture or low saliency tend to retain embedded information after compression or filtering. Embedding data in these blocks minimizes chances of detection by the human visual system.

This aligns with observations in recent surveys on watermarking robustness [3], which show that selecting textured or noisy regions improves survival under MPEG compression, resizing, and Gaussian noise. To perform selection, we apply a hard threshold

τ

on the perceptual score map

P_{i, j}

. Blocks whose score exceeds this threshold are considered robust:

B_{s e l} = \{B_{i, j} | P_{i, j} > τ\}

(3)

where

B_{s e l}

is the set of selected 8 × 8 blocks for watermark embedding,

P_{i, j}

is the perceptual score block

(i, j)

computed in Section 3.1.1. Step 1, and

τ

is the adaptive threshold. To maintain flexibility across different video contents, we implement an adaptive version of the threshold

τ

based on a percentile criterion:

τ = μ_{P} + λ \cdot σ_{P}

where

μ_{P}

denotes the mean of all

P_{i, j}

,

σ_{P}

is the standard deviation of all

P_{i, j}

, and

λ

is a scalar hyperparameter. This strategy allows the number of selected blocks to dynamically adjust depending on the video’s perceptual complexity. To ensure that sufficient capacity is maintained for watermark embedding, the number of selected blocks must meet a minimum threshold

N_{m i n}

. If not enough blocks satisfy the condition,

λ

is iteratively reduced until the following:

|B_{s e l}| \geq N_{m i n}

This dual-constraint system (quality and quantity) ensures that the embedding is both robust and feasible.

3.1.3. Step 3: Calculate DCT Coefficients Directly from Spatial Domain

To avoid the high cost of computing the full DCT of every selected block, we use a closed-form equation to compute the specific DCT coefficient directly in the spatial domain [39]. For a block

f (m, n)

of size

M \times M

, the DCT coefficient at position

(u, v)

is as follows:

F (u, v) = \frac{2}{M} α_{u} α_{v} \sum_{m = 0}^{M - 1} \sum_{n = 0}^{M - 1} f (m, n) c o s (\frac{π u (2 m + 1)}{2 M}) c o s (\frac{π v (2 n + 1)}{2 M}),

(4)

where:

α_{k} = \{\begin{matrix} \frac{1}{\sqrt{2}}, i f k = 0 \\ 1, o t h e r w i s e \end{matrix},

(5)

This formulation makes it possible to extract or modify any single DCT coefficient with no need to compute the full transformation of the block.

3.1.4. Step 4: JND Modulation and Embedding with QIM

The core of the watermarking process is the actual embedding of data into selected video blocks. This step implements a perceptually adaptive embedding scheme based on QIM. While QIM provides robustness, it does not inherently account for the sensitivity of the human visual system. For that reason, we modulate QIM’s quantization step Δ using a JND model

J (i, j)

, ensuring that watermark energy remains invisible to the human eye, particularly in smooth or uniform regions [40].

Let

C (u^{*}, v^{*})

be a mid-frequency DCT coefficient of block

B_{i, j}

. The embedded coefficient is as follows:

C^{'} (u^{*}, v^{*}) = ∆ \cdot ⌊\frac{C (u^{*}, v^{*})}{∆}⌋ + 0.25 + 0.5 w,

(6)

where

w \in \{0,1\}

is the watermark bit, and Δ is the JND-modulated quantization step:

∆ = Q \cdot J (i, j),

(7)

where

Q

is a global scaling factor (experimentally set) and

J (i, j)

is the perceptual threshold for block

B_{i, j}

. The JND threshold

J (i, j)

combines three perceptual effects:

J (i, j) = L (i, j) + M_{s} (i, j) + C_{m} (i, j),

(8)

L (i, j) = \frac{1}{64} \sum_{(m, n) \in B_{i, j}} I (m, n),

M_{s} (i, j) = V a r (B_{i, j}) = \frac{1}{64} \sum_{(m, n) \in B_{i, j}} {(I (m, n) - μ_{B_{i, j}})}^{2},

C_{m} (i, j) = \sqrt{{(\partial_{x} I)}^{2} + {(\partial_{y} I)}^{2}},

where

L (i, j)

is the luminance masking to locate perception changes based on local brightness,

M_{s} (i, j)

is the spatial masking, enabling the detection of areas with increased variance where greater distortion can be concealed, and

C_{m} (i, j)

is the contrast masking based on local gradient magnitude. This comprehensive JND model is consistent with perceptual theories and empirical models described in [40] and provides a content-adaptive embedding strength that preserves image quality.

Embedding is made in mid-frequency DCT coefficients, specifically

(u^{*}, v^{*})

= (2, 1), which offer the best compromise between robustness and imperceptibility [25].

3.1.5. Step 5: Redundancy Encoding and Temporal Distribution

In video watermarking, the channel is not memoryless. Frames may be lost, dropped, or corrupted due to compression, transmission errors, or editing operations. Therefore, to ensure robustness, watermark data must be distributed redundantly across the temporal domain. This step enhances resilience by employing error-correcting strategies and intelligent frame selection to protect the embedded watermark bits over time.

The watermark bitstream

w = \{w_{1}, w_{2}, \dots, w_{n}\}

is encoded using repetition or a low-rate error-correcting code (e.g., BCH, LDPC). For simplicity and high decoding reliability, we implement repetition:

w_{r} = \underset{r t i m e s}{\underset{⏟}{\{w_{1}, \dots, w_{1}\}},} \dots, \underset{r t i m e s}{\underset{⏟}{\{w_{n}, \dots, w_{n}\}}},

(9)

where

r

is the redundancy factor (e.g., 3, 5, or 7). Each watermark bit is embedded across multiple independent frames.

To maximize the effectiveness of temporal redundancy, we distribute the watermark bits only across scene-representative keyframes, not arbitrarily across the timeline. Scene segmentation is based on inter-frame content changes [26]. For each frame

F_{k}

, compute the DC map (

D_{k})

:

D_{k} (i, j) = \frac{1}{64} \sum_{m = 0}^{7} \sum_{n = 0}^{7} F_{k} (8 i + m, 8 j + n),

(10)

Then, the histogram difference between adjacent frames

d_{k}

is computed as follows:

d_{k} = {‖h i s t (D_{k}) - h i s t (D_{k + 1})‖}_{2},

(11)

A scene boundary is defined if

d_{k} > θ

. Then we select a keyframe

K_{s}

for each scene as the temporal midpoint:

K_{s} = F_{⌊(t_{s t a r t} + t_{e n d}) / 2⌋},

(12)

where

t_{s t a r t}

and

t_{e n d}

mark the beginning and end of a shot or scene. If there are

M

keyframes in a video, each bit in

w

is assigned to

r

of these keyframes, distributed uniformly or pseudo-randomly. This ensures error tolerance (multiple chances for recovery) and resilience to frame drops (if any copy survives, decoding remains possible). Please note that selecting parameter values involves a tradeoff. A higher redundancy factor

r

implies increased robustness, although diminished capacity. The scene-change threshold value

θ

will dictate the quantity of video keyframes, while keyframe density signifies a balance between capacity and temporal distribution. This tradeoff is dynamically regulated according to video duration, watermark dimensions, and embedding intensity.

3.1.6. Step 6: Watermark Extraction and Voting

The final stage of the proposed framework is responsible for accurately retrieving the embedded watermark from a potentially distorted or attacked video sequence. This step must be robust to quantization noise, frame loss or temporal reordering, transcoding, and post-processing distortions. To achieve this, we employ an inverse QIM decoding process coupled with majority voting across temporally distributed redundant embeddings.

Assuming that the watermark was embedded into the DCT coefficient

C^{'} (u^{*}, v^{*})

, the decoder extracts each bit by checking the parity on the quantization bit:

\hat{w_{i}} = \{\begin{array}{l} 0, if \mod (⌊\frac{C^{'} (u^{*}, v^{*})}{∆}⌋, 2) = 0 \\ 1, otherwise \end{array}

(13)

where

∆

is the same quantization step used during embedding; the modulo operation maps the coefficient to its associated binary label. The decoding assumes synchronization with the embedding locations, which is maintained through consistent block selection and frame alignment from earlier stages.

Given that each bit

w_{i}

was embedded across

r

different keyframes (see Step 4), we extract a set of candidate values

{{\hat{w}}_{i, 1}, {\hat{w}}_{i, 2}, \dots, {\hat{w}}_{i, r}}

. To eliminate the effect of distortions, or loss, we apply majority voting:

w_{i} = \{\begin{array}{l} 1, i f \sum_{j = 1}^{r} {\hat{w}}_{i, j} > r / 2 \\ 0, otherwise \end{array}

(14)

This approach guarantees correct decoding even in the presence of up to

⌊(r - 1) / 2⌋

bit errors per symbol, akin to a

(r, 1)

repetition code.

In environments with high noise levels or under heavy compression, the decoding performance can be enhanced by adjusting the QIM threshold in accordance with the statistics of the recovered coefficients:

∆^{*} = \arg \max_{∆} C o n s i s t e n c y (\hat{w})

(15)

where “Consistency” can be measured via entropy, correlation with known pilot bits, or Hamming stability over time [41]. This step consolidates the watermark recovery process, ensuring that individual bit errors do not propagate, bit-level synchronization is preserved across frames, and recovery is possible even under temporal and spatial distortions. It complements the perceptual embedding, adaptive quantization, and redundancy strategies by closing the loop with a robust and statistically sound extraction mechanism.

3.2. Block-Based Watermark Embedding Scheme

This section details the watermark-embedding process carried out at the block level, integrating the perceptual, structural, and temporal considerations outlined in Section 3.1. The method operates on selected 8 × 8 blocks from keyframes, applying a perceptually adaptive quantization strategy to embed watermark bits into specific DCT coefficients in a robust yet imperceptible manner. The watermark-embedding process consists of the following steps:

Each keyframe is divided into a non-overlapping 8 × 8 block.
Compute the perceptual score map $P_{i, j}$ and select robust blocks $B_{s e l}$ .
For each selected block $B_{i, j}$ , compute the DCT coefficient $C (2,1)$ directly from the spatial domain using Equation (4).
Estimate the JND threshold $J (i, j)$ for the block $B_{i, j}$ .
Compute the quantization step as $∆ = Q \cdot J (i, j)$ , where $Q$ is a fixed scaling factor.
Modify $C (2,1)$ using QIM, i.e., applying Equation (6) with $u = 2, v = 1$ .
Replace the original coefficient $C (2,1)$ with its modified version $C' (2,1)$ .

This embedding process is applied to each bit across multiple keyframes as determined by the temporal redundancy scheme described in Section 3.1.5. In this way, the method achieves robustness not only at the spatial level but also across the temporal dimension, achieving robustness for attacks such as frame loss, swapping, and transcoding.

3.3. Watermark Extraction Procedure

The watermark extraction process reconstructs the embedded message by replicating the core stages of the embedding pipeline, using only the watermarked video as input. The extraction relies on the spatial-domain computation of the DCT coefficient

C (2,1)

, and on perceptual coherence to locate suitable blocks. Temporal redundancy is resolved by aggregating multiple extractions through majority voting. The complete watermark extraction involves the following steps:

Divide each keyframe into non-overlapping 8 × 8 blocks.
Compute the perceptual score map $P_{i, j}$ for each block using the same method as in Section 3.1.1. and select the top-ranked blocks as candidates for decoding.
For each selected block $B_{i, j}$ , compute the DCT coefficient $C' (2,1)$ directly from the spatial domain using Equation (4).
Recompute the quantization step using $∆ = Q \cdot J (i, j)$ .
Decode the embedded bit from the quantized coefficient using the inverse QIM rule with Equation (13).
Repeat the decoding for the same bit across the $r$ redundant keyframes.
Apply majority voting to recover the final bit by using Equation (15).

This block-wise and coefficient-specific recovery scheme ensures reliable extraction even under heavy distortion. The majority voting mechanism provides robustness to isolated bit errors caused by compression, noise, or frame manipulation, making the scheme suitable for practical blind watermarking applications.

Figure 2 shows a schematic summary of the complete embedding and extraction process. The upper section corresponds to watermark embedding, detailing the flow from perceptual map generation, block selection, coefficient computation, adaptive quantization, and temporal distribution. The lower section outlines the extraction stage, highlighting how watermark bits are retrieved from selected frames and decoded using majority voting. This flowchart reinforces the modular design of the method and clarifies the interaction between perceptual and temporal components.

4. Experimental Results

4.1. Configuration

To validate the robustness, imperceptibility, and computational efficiency of the proposed watermarking scheme, we conducted a comprehensive set of experiments using a diverse set of publicly available video sequences [42], covering three commonly used resolutions and visual contexts. The CIF set (352 × 288) includes Akiyo, Carphone, Foreman, Grandma, Hall Monitor, Suzie, and Tempete; the 4CIF set (704 × 576) includes Australia, City, Crew, Harbour, Ice, Football, and Soccer; and the HD set (1280 × 720) includes Snow Mountain, Ducks Take Off, Kristen and Sara, Speed Bag, Tractor, and Vidyo1. These sequences span indoor and outdoor settings, low and high motion dynamics, and varying textures such as faces, objects, and natural backgrounds. All videos were between 8 and 10 s in duration, encoded at either 25 or 30 frames per second, and processed in their uncompressed YUV 4:2:0 format.

To test robustness under realistic degradation scenarios, we simulated a range of common signal and geometric attacks. Compression was applied using H.264 (CRF = 28) and VP8 (Q = 30) encoding. Gaussian noise was introduced with

μ = 0

and

σ^{2} = 0.05

. Geometric transformations included small-angle rotations (±5°), rescaling (90%), and cropping of 10% from each border (top, bottom, left, and right). Temporal desynchronization was performed by shifting watermark extraction indices randomly by ±3 frames. These conditions were chosen to reflect the most common distortions encountered in multimedia transmission, compression pipelines, and adversarial tampering.

The watermark used in all experiments consisted of 128-bit pseudo-random binary sequences. Each bit

w_{i}

, was embedded across

r = 5

keyframes identified through histogram-based scene segmentation. Embedding was achieved on the

C (2,1)

DCT coefficient of 8 × 8 blocks, computed directly in the spatial domain as described in Section 3.1.3. The quantization step

∆

was modulated using the local JND threshold and a global scaling factor. The JND values incorporated luminance, spatial, and contrast masking terms. Table 1 presents the parameters together with their corresponding values utilized in all experiments.

The selection of hyperparameters for the proposed method was guided by empirical testing and insights from the prior literature. The weights

α

,

β

, and

γ

in the perceptual score map were set to 0.4, 0.3, and 0.3, respectively, following a grid search on a subset of CIF and 4CIF videos to balance between saliency suppression and detail enhancement. The quantization step scaling factor

Q

was selected to match the perceptual thresholding behavior of prior JND-based video watermarking studies [18], offering a trade-off between imperceptibility and bit recoverability. The redundancy factor

r

was chosen based on results in Section 4.2.2., which showed that performance gains saturated at this level. The scene-change threshold

θ = 30

(used in histogram-based keyframe detection) was selected empirically to reliably identify transitions across diverse content types. All parameters were kept fixed across all test sequences to ensure consistency and comparability.

To assess the performance of the proposed watermarking scheme, we employed a set of widely used evaluation metrics that capture different aspects of quality and robustness. Peak Signal-to-Noise Ratio (PSNR) was used to quantify signal-level distortion, while Structural Similarity Index (SSIM) [43] evaluated the preservation of structural fidelity in the watermarked video. To estimate perceptual visual quality, we applied Video Multimethod Assessment Fusion (VMAF) [44], which uses machine learning models trained on subjective quality assessments. The BER was computed to measure the robustness of watermark extraction under various conditions. Finally, runtime-in-seconds was recorded to analyze the computational efficiency of both the embedding and extraction processes.

All experiments were implemented in Python 3.9.13 using OpenCV 4.5.5, NumPy 1. 21.5, SciPy 1.7.3, and FFmpeg 4.4.1 (for VMAF), and executed on a workstation with an AMD Ryzen 9 5950X CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA), 64 GB of RAM, and no GPU acceleration.

4.2. Baseline Performance Evaluation

To rigorously assess the effectiveness of the proposed watermarking method, we conducted a sequence of experiments designed to isolate and quantify the contribution of each of its key components: adaptive quantization modulation, temporal redundancy, and structured multiframe insertion. All tests were conducted in the absence of attacks to capture baseline performance in ideal but realistic conditions. We report results in terms of BER, PSNR, SSIM, VMAF, and recovery accuracy.

4.2.1. Impact of Adaptive Modulation

We compared two scenarios: a fixed quantization step (

∆ = 6

) and our proposed adaptive JND-based quantization. In both settings, a 128-bit watermark was embedded over 5 keyframes per bit. The results in Table 2 show that adaptive modulation significantly enhanced visual quality while also reducing BER.

These results confirm that JND-modulated quantization better preserves both structural and perceptual quality, especially in HD videos.

4.2.2. Impact of Redundant Coding

One of the main innovations in the proposed watermarking framework is the use of temporal redundancy to increase robustness against accidental bit loss and reduce the impact of localized frame degradation. In practical scenarios—such as low-bitrate compression, transmission jitter, or dropped frames—it is highly likely that a portion of the embedded watermark bits may be corrupted or lost. By repeating each watermark bit across multiple temporally separated keyframes and applying a majority voting scheme during extraction, the system can effectively filter out isolated errors and improve the overall reliability of recovery.

To evaluate this mechanism, we conducted an experiment where a 128-bit watermark was embedded using five different levels of redundancy,

r \in \{1,2, 3,4, 5\}

, where

r

denotes the number of keyframes per bit. The watermarking process was carried out on all videos from the test set using adaptive quantization, and no external attacks were applied. The results were measured in terms of average BER, number of frames used, and recovery accuracy.

As shown in Table 3, the BER decreases as the redundancy factor increases. Even a moderate level of redundancy (

r = 3

) brings the BER below 2.0%, while full redundancy (

r = 5

) pushes recovery accuracy to nearly 99.7%. These results demonstrate that temporal redundancy is a powerful mechanism for ensuring reliable watermark extraction, especially when used in conjunction with perceptually optimized embedding.

To justify the selection of the default redundancy factor r = 5 in subsequent experiments, we carefully analyzed the trade-off between robustness and resource usage. As Table 3 indicates, increasing the redundancy factor consistently reduces BER and enhances recovery accuracy. The gain from r = 4 (1.1% BER, 98.9% accuracy) to r = 5 (0.2% BER, 99.8% accuracy) is still significant and yields nearly error-free extraction, making it a natural choice. Although using r = 5 implies 640 frames (compared to 128 when r = 1), the cost is acceptable given the method’s computational efficiency and the overall improvement in decoding reliability. We, therefore, adopted r = 5 as the standard configuration for all experiments involving distortion and benchmarking.

While redundancy plays a key role in the robustness of the proposed method, it is important to clarify that the gains reported in Table 3 are primarily the result of temporal redundancy. In our setup, redundancy is implemented by distributing the same watermark bit across multiple keyframes in different temporal positions. Since all other components—such as block selection, embedding strength, and QIM modulation—remain fixed across these tests, the observed improvements in BER (from 5.2% to 0.2%) and recovery accuracy (from 94.8% to 99.8%) can be attributed directly to the redundancy mechanism. Moreover, because the watermark is embedded in different perceptually strong regions across frames, the impact of localized attacks (e.g., cropping or frame loss) is significantly mitigated. This confirms that temporal distribution is not only a complementary feature, but a critical contributor to the method’s overall resilience.

4.2.3. Impact of Multiframe Insertion Strategy

This experiment evaluated how the temporal distribution of watermark bits affects both robustness and perceptual quality. We compared the proposed scene-based insertion strategy—where each bit is embedded at the temporal midpoint of five detected scenes—with a random frame selection baseline. In both cases, a 128-bit watermark was embedded using adaptive quantization and

r = 5

redundancy.

Scene segmentation was performed using histogram-based shot boundary detection, ensuring that bits were embedded in visually stable regions. In contrast, the random strategy placed bits in arbitrary frames, regardless of scene content.

As shown in Table 4, the structured insertion method yielded significantly lower BER and synchronization error, along with higher VMAF scores. The benefits were especially notable in dynamic content, where random insertion is more likely to coincide with scene cuts or motion peaks.

These results confirm that temporal awareness during embedding improves both robustness and perceptual stability, particularly in videos with complex motion or frequent scene changes.

4.3. Evaluation Under Attacks

In order to assess the real-world applicability of the proposed watermarking scheme, we conducted a systematic evaluation of its robustness under a diverse set of common video impairments. This type of analysis is critical in watermarking research, as any practical solution must not only be imperceptible and efficient, but also resilient against typical transformations that video content undergoes during distribution, editing, compression, or playback. While the previous experiments demonstrated the method’s imperceptibility and robustness in ideal conditions, the current section focuses on how the proposal performs under actual distortion, without access to the original unmodified content.

Four categories of attacks were selected for this evaluation, each simulating a distinct class of degradation. First, Gaussian noise (

μ = 0, σ^{2} = 0.05)

was added to the whole frames, emulating noise, re-encoding errors, or transmission artifacts. Second, video compression was applied using two widely used lossy codecs: H.264 (configured at CRF 28) and VP8 (quality level 30). These attacks simulate bandwidth-constrained streaming or export operations. Finally, a temporal desynchronization attack was performed by shifting watermark extraction indices randomly by ±3 frames, simulating situations where scenes are edited, frames are dropped, or timing variations occur during decoding or rendering.

The results, summarized in Table 5, reveal consistent robustness of the proposed watermarking system across all attack scenarios and video resolutions. Under Gaussian noise, the average BER remained low (0.6% for CIF, 1.1% for 4CIF, and 1.4% for HD), with SSIM degradation below 0.086 and PSNR loss under 7 dB. These results confirm that the adaptive quantization mechanism distributes watermark energy in structurally rich regions where random noise has limited perceptual impact.

In the case of video compression, H.264 was more aggressive than VP8, particularly for HD content where the BER reached 1.8% and SSIM dropped by 0.174. However, even in this setting, recovery accuracy remained above 98.6% and VMAF degradation stayed within tolerable bounds. Please note that in typical H.264 compression using FFmpeg and the libx264 encoder, a CRF value of 28 relates to a relevant compression level, yielding an average bitrate between 800 and 1500 kbps for CIF and 4CIF video content, and 1.5 to 3 Mbps for HD content, depending on motion complexity and scene variability. In VP8 compression, the parameter Q refers to the target constant quantizer value, with a range between 4 (high quality) and 63 (maximum compression). A setting of Q = 30 corresponds to moderate-to-high compression, typically resulting in bitrates between 600 and 1200 kbps for CIF and 4CIF videos, and 1–2 Mbps for HD content. Actual bitrate depends on frame complexity and motion, as VP8 is a variable bitrate codec.

The watermark also exhibited strong resilience to temporal desynchronization, with BER below 1.2% across all formats, largely due to the structured scene-based insertion and majority voting.

To further assess the robustness of the proposed method, we extended our evaluation to include geometric attacks, which are known to challenge spatial-domain watermarking systems due to their pixel-coordinate transformations. Specifically, we considered three common distortions: Rotation of ±5°, which introduces misalignment at the block level, rescaling to 90% of the original frame size (followed by interpolation back to original dimensions), and cropping of 10% from each border (top, bottom, left, right), removing watermark-bearing regions.

These attacks were applied to keyframes from seven representative videos across the three resolution classes (CIF, 4CIF, HD). The extracted watermark was evaluated using BER, and visual degradation via SSIM. The results are summarized in Table 6.

Despite operating in the spatial domain, the method exhibits moderate resilience to geometric transformations, largely due to its use of block-based embedding and multiframe redundancy. The watermark remained at least 95% recoverable in all cases, and visual quality degradation remained within acceptable limits (SSIM > 0.85 for most attacks). This suggests that the proposed method can tolerate mild spatial misalignments, particularly when combined with voting-based recovery mechanisms.

The results validate that the proposed method effectively generalizes across various video formats. While HD content is more sensitive to compression and rotation (due to greater spatial detail), the watermark remains detectable and visually unobtrusive. In general, the scheme’s suitability for blind watermarking for multimedia streaming and distribution platforms is validated by their robustness against typical video attacks, which is achieved through the combination of local perceptual embedding, spatially optimized block selection, and temporally distributed redundancy. For a better appreciation of the visual quality distortion caused by the attacks, Figure 3 presents the Suzie video sequence alongside its watermarked version. The visual quality degradation caused by Gaussian noise, H.264 and VP8 video coding, and cropping attacks are also shown in Figure 3.

4.4. Sensitivity Analysis

To evaluate the stability of the proposed method under different parameter configurations, we conducted a sensitivity analysis by individually varying key parameters while keeping others fixed. This helps validate the robustness of the design choices and facilitates reproducibility for future implementations.

Table 7 shows the results of modifying three parameters: the semantic masking weight

γ

in Equation (2), the quantization scaling factor Q, and the block selection threshold τ. Average BER and PSNR are reported across all CIF videos. The analysis confirms that the method tolerates small deviations in these parameters without major performance degradation. The best trade-offs between BER and visual quality are obtained near the default values

γ

= 0.3, Q = 0.75, and τ = 30, which were thus retained in all experiments.

5. Analysis and Discussion

This section provides a detailed analysis of the performance and internal behavior of the proposed watermarking method, based on the experimental results presented earlier. Beyond raw performance metrics, it explores how various design components interact to influence robustness, imperceptibility, and computational efficiency. The discussion is structured into four subsections: analysis of error patterns and recovery behavior, the synergy between redundancy and adaptive modulation, computational cost breakdown of the proposed method, and a comparative evaluation against other state-of-the-art watermarking techniques.

5.1. Error and Recovery Analysis

The experimental evaluation presented in Section 4 demonstrates that the proposed watermarking method consistently achieves low BER and high recovery accuracy across various test conditions. Under non-attacked conditions, the average BER remained below 0.3% when using a redundancy factor of

r = 5

, and recovery accuracy reached 99.8%. These results are indicative of the method’s effectiveness in ensuring reliable watermark retrieval through majority voting, even when embedded information is distributed across multiple frames.

From a signal-level perspective, the error distribution was not uniform across frames or spatial regions. Visual inspection of the decoding process revealed that most residual bit errors under mild distortion occurred in blocks with intermediate perceptual scores—those on the threshold between inclusion and exclusion during the embedding process. In particular, frames with rapid scene changes or low-texture regions exhibited slightly higher error rates, highlighting the sensitivity of the perceptual block selection mechanism to local variability. However, the redundancy mechanism proved effective in mitigating these issues by averaging detection outcomes across spatial and temporal instances.

The proposed method maintained high recovery accuracy when analyzing robustness against different attacks, with BER values ranging from 1.03% (Gaussian noise) to 2.3% (cropping). The corresponding recovery accuracies—always above 97%—confirm that even under moderate distortion, the watermark can be reconstructed with high fidelity. Importantly, no complete bit loss was observed under any tested conditions, validating the sufficiency of the distributed redundancy strategy.

Another notable observation relates to resolution-dependent behavior. The method showed slightly better BER performance for CIF videos compared to HD videos under compression and cropping attacks. This trend is expected, as higher-resolution frames contain more embedding candidates, which increases both the potential for distortion during encoding and the number of possible mismatches during extraction. Despite this, the recovery accuracy remained within a narrow and acceptable range across all formats, underscoring the scalability of the method.

In summary, the combined use of perceptual embedding, redundancy coding, and temporal majority voting provides a robust error-tolerant mechanism capable of withstanding realistic distortions. The observed errors are minor, localized, and statistically compensated by the system’s redundancy-aware design. These findings affirm the scheme’s practical viability for applications requiring blind and robust watermark extraction without access to the original content.

5.2. Interaction Between Redundancy and Modulation

The interaction between adaptive perceptual modulation and temporal redundancy plays a central role in the effectiveness of the proposed watermarking method. While each technique contributes independently, modulation enhancing imperceptibility and redundancy improving robustness, their combination leads to a complementary and mutually reinforcing effect.

Adaptive modulation, guided by local JND thresholds, ensures that watermark energy is embedded in visually insensitive regions. This allows the system to preserve high perceptual quality, but may leave certain areas under-embedded or unmarked, particularly in low-texture or flat regions. Without additional protection, this could lead to isolated bit loss under distortion.

Temporal redundancy addresses this limitation by replicating each bit across multiple keyframes. The majority voting scheme during extraction benefits from this dispersion, especially because instances with stronger modulation are more likely to survive compression or noise and dominate the voting process. In essence, redundancy increases robustness not just by repetition, but by leveraging the perceptual strength of individual embeddings.

Experimental results confirm this synergy: as the redundancy factor increases, the BER decreases consistently—even when adaptive modulation is active—demonstrating that robustness and imperceptibility need not be traded off but can be optimized jointly. This interplay enables the flexible tuning of the system, balancing visual quality with decoding reliability based on the needs of the application.

5.3. Computational Cost Evaluation

In addition to robustness and imperceptibility, practical watermarking systems must also meet acceptable levels of computational efficiency, especially when applied to large-scale multimedia processing. To evaluate the feasibility of the proposed method, we analyzed the execution time and computational cost of each major component in the embedding and extraction pipeline, with results summarized in Table 8.

The perceptual map computation—which includes multiscale Laplacian filtering, saliency estimation, and semantic masking—is the most computationally expensive component, accounting for approximately 45% of total embedding time. On CIF videos, this stage required around 1.6 s, while on HD content it reached 4.3 s. Despite its cost, this step is highly parallelizable, making it suitable for GPU acceleration.

The keyframe detection and redundancy encoding stage contributed 15% of the overall cost, with average processing times of 0.5 s for CIF and 1.2 s for HD videos. This step is mostly lightweight and relies on histogram-based scene segmentation, which can be partially parallelized.

The DCT coefficient estimation and QIM embedding stage, benefiting from spatial-domain simplifications, showed low computational overhead—0.4 s for CIF and 0.9 s for HD—representing only 10% of total cost. Similarly, the watermark extraction and voting stage can be completed in 0.7 to 1.6 s, depending on resolution, and is well-suited to parallel execution.

Overall, the entire pipeline can be executed for low-resolution content and scales reasonably for HD videos. As detailed in Section 4.1., all experiments were conducted on a workstation equipped with an AMD Ryzen 9 5950X CPU and 64 GB RAM, with no GPU acceleration. The implementation used Python with OpenCV and NumPy, and was executed in a single-threaded configuration. Under these conditions, the average processing time for HD-resolution videos was ~7.8 s per frame, which corresponds to approximately 0.13 frames per second (fps). Although this is not sufficient for real-time streaming, the method was designed for offline applications such as copyright protection, tamper detection, and forensic validation—where robustness and precision are prioritized over execution speed.

We also emphasize that this version is a research prototype and not hardware-optimized. With further engineering improvements—such as GPU-based semantic masking, frame-level parallelization, or approximation of perceptual maps—the processing rate can be significantly improved. Preliminary tests with parallel decoding and lightweight masking models suggest that throughput above 5 fps may be achievable on modern hardware.

5.4. Comparison with State-of-the-Art Methods

To contextualize the performance of the proposed video watermarking method within the current state-of-the-art, we conducted a comparative analysis against five recent and representative watermarking schemes reported in the literature. These include both classical signal processing-based techniques and deep learning-based (DL) approaches, enabling a balanced assessment of robustness, quality, and computational cost.

In particular, we selected two DL-based methods that have gained attention in recent years: DvMark [23], which uses a multiscale frequency-domain neural network trained to embed and extract watermarks robustly under compression; and Hide-and-Track [24], a blind deep-watermarking system trained with adversarial desynchronization objectives. Both of these methods require dedicated training on video datasets and rely on GPU-accelerated inference for watermark extraction.

These were compared alongside traditional transform-domain and hybrid schemes, including Huan et al. [19] (DT-CWT-based blind embedding), Yang et al. [17] (hybrid watermarking with transcoding resistance), and Cedillo-Hernandez et al. [25] (a spatial-domain method with DCT estimation). The proposed method distinguishes itself by not relying on any training phase, operating entirely in the spatial domain, and requiring no access to the original video at extraction time.

As shown in Table 9, the proposed method offers superior imperceptibility and robustness while maintaining lower computational cost compared to deep learning-based and transform-domain schemes. Unlike frequency-domain neural approaches, which often require GPU acceleration and long training times, our system achieves comparable or better BER and quality metrics using a lightweight perceptual framework and simple QIM encoding. Compared to our previous method [25], this new version introduces temporal redundancy and enhanced JND modulation, substantially improving BER performance while preserving visual quality

These findings highlight that integrating perceptual saliency, spatial-domain coefficient estimation, and scene-aware multiframe embedding can outperform more complex models without sacrificing execution time or generalizability.

It is important to note that while our evaluation included a diverse set of video sequences with varying resolutions and motion dynamics (as detailed in Section 4.1.), the datasets were not focused on mobile or real-time streaming scenarios. Future work will address such use cases.

Furthermore, the comparative results shown in Table 9 were either reproduced using publicly available code from the original authors or extracted from their published experimental sections. However, we acknowledge that not all methods were executed under identical hardware or software environments. To ensure fairness, our comparison is focused solely on functional metrics such as robustness (BER), imperceptibility (PSNR, SSIM), and watermark size.

5.5. Statistical Robustness

To assess the consistency and reproducibility of the proposed method, we performed multiple independent runs of the embedding and extraction process using different pseudo-random watermark sequences. Each test was repeated 10 times on the CIF dataset, using the same embedding parameters and video sequences. The average and standard deviation (σ) were computed for key performance metrics, including BER, PSNR, SSIM, and VMAF.

Table 10 shows the results obtained for three representative CIF videos. The low standard deviation values across the runs confirm the robustness and stability of the method, indicating that performance is not significantly influenced by the choice of watermark sequence. These findings strengthen the statistical validity of the proposed scheme.

6. Conclusions

This paper presented a novel video watermarking framework that combines perceptual modeling, spatial-domain coefficient estimation, adaptive quantization, and multiframe redundancy to achieve robust and imperceptible watermark embedding. The method introduces a computationally efficient pipeline that leverages multiscale saliency, semantic masking, and JND-based modulation to localize watermark insertion in visually tolerant regions, while distributing watermark bits temporally to improve resilience against common video impairments.

Through extensive simulations across 21 benchmark videos at CIF, 4CIF, and HD resolutions, the proposed method demonstrated strong performance across all evaluation metrics. It achieved an average BER of 1.03%, a PSNR of 50.1 dB, SSIM of 0.996, and VMAF of 97.3 under realistic conditions, significantly outperforming several state-of-the-art techniques in both robustness and perceptual quality. The method also proved to be computationally efficient, with embedding and extraction times compatible with low complexity constraints and adaptable to parallel hardware execution.

A key strength of the framework lies in the interaction between its components. Adaptive modulation allows the system to preserve visual quality, while redundancy and multiframe embedding ensure that the watermark remains recoverable even under compression, noise, cropping, and temporal desynchronization. This balance between imperceptibility, robustness, and efficiency makes the method suitable for use in streaming platforms, secure content distribution, and digital rights management.

In future work, the method could be extended by incorporating learning-based perceptual models or dynamic redundancy allocation based on scene complexity. Additionally, optimizing the pipeline for embedded GPUs or mobile devices could facilitate deployment in edge computing scenarios. Finally, exploring hybrid frequency-spatial domain modulation may further enhance robustness against advanced video tampering and re-encoding strategies.

Author Contributions

Conceptualization, A.C.-H. and L.V.-G.; methodology, I.D.-J. and D.C.-G.; software, I.D.-J.; validation, A.C.-H. and M.C.-H.; formal analysis, D.C.-G.; investigation, L.V.-G.; writing—original draft preparation, A.C.-H. and L.V.-G.; writing—review and editing, A.C.-H. and M.C.-H.; funding acquisition, A.C.-H. and M.C.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Secretaria de Ciencia, Humanidades, Tecnologia e Innovacion (SECIHTI) under scholarship numbers 161591 and 205834, and by the Secretaria de Investigacion y Posgrado del Instituto Politécnico Nacional (SIP-IPN) through project SIP No. 20250064.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors wish to thank the Tecnologico de Monterrey, the Instituto Politécnico Nacional, and the Secretaria de Ciencia, Humanidades, Tecnologia e Innovacion (SECIHTI) for their support during this research.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu, G.; Xiang, R.; Liu, J.; Pan, R.; Zhang, Z. An invisible and robust watermarking scheme using convolutional neural networks. Expert Syst. Appl. 2022, 210, 118529. [Google Scholar] [CrossRef]
Agarwal, N.; Singh, A.K.; Singh, P.K. Survey of robust and imperceptible watermarking. Multimed. Tools Appl. 2019, 78, 8603–8633. [Google Scholar] [CrossRef]
Wan, W.; Wang, J.; Zhang, Y.; Li, J.; Yu, H.; Sun, J. A comprehensive survey on robust image watermarking. Neurocomputing 2022, 488, 226–247. [Google Scholar] [CrossRef]
Zhong, X.; Das, A.; Alrasheedi, F.; Tanvir, A. A Brief, In-Depth Survey of Deep Learning-Based Image Watermarking. Appl. Sci. 2023, 13, 11852. [Google Scholar] [CrossRef]
Charfeddine, M.; Mezghani, E.; Masmoudi, S.; Amar, C.B.; Alhumyani, H. Audio watermarking for security and non-security applications. IEEE Access 2022, 10, 12654–12677. [Google Scholar] [CrossRef]
Aberna, P.; Agilandeeswari, L. Digital image and video watermarking: Methodologies, attacks, applications, and future directions. Multimed. Tools Appl. 2023, 82, 5531–5591. [Google Scholar] [CrossRef]
Yu, X.; Wang, C.; Zhou, X. A survey on robust video watermarking algorithms for copyright protection. Appl. Sci. 2018, 8, 1891. [Google Scholar] [CrossRef]
Asikuzzaman, M.; Pickering, M.R. An overview of digital video watermarking. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2131–2153. [Google Scholar] [CrossRef]
Abdelwahab, K.M.; Abd El-Atty, S.M.; El-Shafai, W.; El-Rabaie, S.; Abd El-Samie, F.E. Efficient SVD-based audio watermarking technique in FRT domain. Multimed. Tools Appl. 2020, 79, 5617–5648. [Google Scholar] [CrossRef]
Zainol, Z.; Teh, J.S.; Alawida, M.; Alabdulatif, A. Hybrid SVD-based image watermarking schemes: A review. IEEE Access 2021, 9, 32931–32968. [Google Scholar] [CrossRef]
Masmoudi, S.; Charfeddine, M.; Ben Amar, C. A semi-fragile digital audio watermarking scheme for MP3-encoded signals. Circ. Syst. Signal Process. 2020, 39, 3019–3034. [Google Scholar] [CrossRef]
Chen, L.; Wang, C.; Zhou, X.; Qin, Z. Robust and Compatible Video Watermarking via Spatio-Temporal Enhancement and Multiscale Pyramid Attention. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 1548–1561. [Google Scholar] [CrossRef]
Farri, E.; Ayubi, P. A robust digital video watermarking based on CT-SVD domain and chaotic DNA sequences for copyright protection. J. Ambient Intell. Humaniz. Comput. 2023, 14, 13113–13137. [Google Scholar] [CrossRef]
Wan, W.; Zhou, K.; Zhang, K.; Zhan, Y.; Li, J. JND-guided perceptually color image watermarking in spatial domain. IEEE Access 2020, 8, 164504–164520. [Google Scholar] [CrossRef]
Lin, W.; Ghinea, G. Progress and opportunities in modelling just-noticeable difference (JND) for multimedia. IEEE Trans. Multimed. 2021, 24, 3706–3721. [Google Scholar] [CrossRef]
Cedillo-Hernandez, A.; Cedillo-Hernandez, M.; Miyatake, M.N.; Meana, H.P. A spatiotemporal saliency-modulated JND profile applied to video watermarking. J. Vis. Commun. Image R. 2018, 52, 106–117. [Google Scholar] [CrossRef]
Yang, L.; Wang, H.; Zhang, Y.; He, M.; Li, J. An adaptive video watermarking robust to social platform transcoding and hybrid attacks. Signal Process. 2024, 224, 109588. [Google Scholar] [CrossRef]
Cedillo-Hernandez, A.; Cedillo-Hernandez, M.; García-Vázquez, M.; Nakano-Miyatake, M.; Perez-Meana, H.; Ramirez-Acosta, A. Transcoding resilient video watermarking scheme based on spatio-temporal HVS and DCT. Signal Process. 2014, 97, 40–54. [Google Scholar] [CrossRef]
Huan, W.; Li, S.; Qian, Z.; Zhang, X. Exploring stable coefficients on joint sub-bands for robust video watermarking in DT CWT domain. IEEE T. Circ. Syst. Vid. 2022, 32, 1955–1965. [Google Scholar] [CrossRef]
Chen, B.; Wornell, G.W. Quantization index modulation: A class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inform. Theory 2001, 47, 1423–1443. [Google Scholar] [CrossRef]
Lin, E.T.; Delp, E.J. Temporal synchronization in video watermarking. IEEE Trans. Signal Process. 1999, 51, 1053–1069. [Google Scholar] [CrossRef]
Koz, A.; Alatan, A.A. Oblivious spatio-temporal watermarking of digital video by exploiting the human visual system. IEEE Trans. Circuits Syst. Video Technol. 2008, 18, 326–337. [Google Scholar] [CrossRef]
Luo, X.; Li, Y.; Chang, H.; Liu, C.; Milanfar, P.; Yang, F. Dvmark: A deep multiscale framework for video watermarking. IEEE Trans. Image Process. 2023, 32, 4769–4782. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, H.; Wang, G.; Wu, X. Hide and track: Towards blind video watermarking network in frequency domain. Neurocomputing 2024, 579, 127435. [Google Scholar] [CrossRef]
Cedillo-Hernandez, A.; Velazquez-Garcia, L.; Cedillo-Hernandez, M.; Conchouso-Gonzalez, D. Fast and robust JND-guided video watermarking scheme in spatial domain. J. King Saud Univ.–Comput. Inf. Sci. 2024, 36, 102199. [Google Scholar] [CrossRef]
Hernandez, A.C.; Hernandez, M.C.; Ugalde, F.G.; Miyatake, M.N.; Meana, H.P. A fast and effective method for static video summarization on compressed domain. IEEE Lat. Am. Trans. 2016, 14, 4554–4559. [Google Scholar] [CrossRef]
Yamamoto, T.; Kawamura, M. Method of spread spectrum watermarking using quantization index modulation for cropped images. IEICE Trans. Inf. Syst. 2015, 98, 1306–1315. [Google Scholar] [CrossRef]
He, M.; Wang, H.; Zhang, F.; Abdullahi, S.M.; Yang, L. Robust blind video watermarking against geometric deformations and online video sharing platform processing. IEEE Trans. Dependable Secur. Comput. 2022, 20, 4702–4718. [Google Scholar] [CrossRef]
Nayak, A.A.; Venugopala, P.S.; Sarojadevi, H.; Ashwini, B.; Chiplunkar, N.N. A novel watermarking technique for video on android mobile devices based on JPG quantization value and DCT. Multimed. Tools Appl. 2024, 83, 47889–47917. [Google Scholar] [CrossRef]
Prasetyo, H.; Hsia, C.H.; Liu, C.H. Vulnerability attacks of SVD-based video watermarking scheme in an IoT environment. IEEE Access 2020, 8, 69919–69936. [Google Scholar] [CrossRef]
Pinson, M.H.; Wolf, S. A new standardized method for objectively measuring video quality. IEEE Trans. Broadcast. 2004, 50, 312–322. [Google Scholar] [CrossRef]
Ma, Y.F.; Zhang, H.J. Contrast-based image attention analysis by using fuzzy growing. In Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, 2–8 November 2003; pp. 374–381. [Google Scholar]
Wang, C.; Wang, Y.; Lian, J. A Super Pixel-Wise Just Noticeable Distortion Model. IEEE Access 2020, 8, 204816–204824. [Google Scholar] [CrossRef]
Li, D.; Deng, L.; Gupta, B.B.; Wang, H.; Choi, C. A novel CNN based security guaranteed image watermarking generation scenario for smart city applications. Inform. Sci. 2019, 479, 432–447. [Google Scholar] [CrossRef]
Lin, W.; Kuo, C.C.J. Perceptual visual quality metrics: A survey. J. Vis. Commun. Image Represent. 2011, 22, 297–312. [Google Scholar] [CrossRef]
Wang, H.; Yu, L.; Yin, H.; Li, T.; Wang, S. An improved DCT-based JND estimation model considering multiple masking effects. J. Vis. Commun. Image Represent. 2020, 71, 102850. [Google Scholar] [CrossRef]
Hou, X.; Zhang, L. Saliency detection: A spectral residual approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vision 2004, 57, 137–154. [Google Scholar] [CrossRef]
Byun, S.W.; Son, H.S.; Lee, S.P. Fast and robust watermarking method based on DCT specific location. IEEE Access 2019, 7, 100706–100718. [Google Scholar] [CrossRef]
Wu, J.; Li, L.; Dong, W.; Shi, G.; Lin, W.; Kuo, C.C.J. Enhanced just noticeable difference model for images with pattern complexity. IEEE Trans. Image Process. 2017, 26, 2682–2693. [Google Scholar] [CrossRef] [PubMed]
Hernandez, J.R.; Perez-Gonzalez, F. Statistical analysis of watermarking schemes for copyright protection of images. Proc. IEEE 2000, 87, 1142–1166. [Google Scholar] [CrossRef]
Video Sequences for Testing. Available online: https://sites.google.com/site/researchvideosequences (accessed on 9 July 2025).
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Li, X.; Guo, Q.; Lu, X. Spatiotemporal statistics for video quality assessment. IEEE Trans. Image Process. 2016, 25, 3329–3342. [Google Scholar] [CrossRef] [PubMed]

Figure 1. An example of the perceptual map computation for “Akiyo” video sequence: (a) Multiscale detail map, (b) Saliency map, (c) Semantic mask, and (d) Perceptual score map.

Figure 2. Summary of the complete embedding and extraction watermarking process.

Figure 3. (a) Keyframe of “Suzie” video sequence, (b) its watermarked version, (c) Gaussian noise, (d) H.264 compression (CRF 28), (e) VP8 compression (Q = 30), and (f) Cropping (10%).

Table 1. Experimental parameters used in the watermark-embedding process.

Parameter	Value
Block size	8 × 8
Embedding DCT coefficient	$C (2,1)$
$Quantization scale Q$	0.75
$Redundancy factor r$	5 keyframes per bit
Watermark payload	128 bits
$Scene - change threshold θ$	30
Perceptual map weights	$α$ $= 0.4, β$ $= 0.3, and γ$ = 0.3

Table 2. Comparison of objective metrics with and without adaptive modulation ¹.

Metric	$Fixed ∆$	$Adaptive ∆$	Improvement
PSNR (dB)	39.42	50.12	+10.70
SSIM	0.918	0.996	+0.076
VMAF	83.1	97.3	+14.2
BER%	2.3	1.4	−0.9

¹ Average across all videos.

Table 3. Effect of redundancy factor on BER and watermark recovery rate ¹.

Redundancy	Avg. BER (%)	Frames Used	Recovery Accuracy
1	5.2	128	94.8%
2	3.1	256	96.9%
3	1.8	384	98.2%
4	1.1	512	98.9%
5	0.2	640	99.8%

¹ Average across all videos.

Table 4. Comparison of BER and VMAF for multiframe insertion strategies.

Strategy	BER (%)	VMAF	Avg. Sync Error
Random Frame Insertion	2.7	89.7	1.8
Scene-Based Insertion	0.3	95.2	0.5

Table 5. Watermark robustness under common video attacks.

Atack Type	BER %				SSIM Drop	PSNR Drop	VMAF Drop
Atack Type	CIF	4CIF	HD	Avg.	SSIM Drop	PSNR Drop	VMAF Drop
Gaussian noise	0.6	1.1	1.4	1.03	0.085	6.8	10.5
H.264 Compression (CRF 28)	0.9	1.5	1.8	1.40	0.120	10.3	16.2
VP8 Compression (Q = 30)	0.7	1.4	1.7	1.26	0.098	9.2	14.9
Temporal Desync (±3 frames)	0.7	1.2	1.5	1.13	0.094	7.9	11.5

Table 6. Watermark robustness against geometric attacks.

Attack	BER (%)/SSIM
Attack	CIF	4CIF	HD	Overall
Rotation ±5°	3.9/0.87	4.5/0.85	4.8/0.83	4.4/0.85
Rescaling to 90%	2.5/0.90	3.2/0.88	3.5/0.86	3.1/0.88
Cropping 10%	1.9/0.92	2.1/0.91	2.4/0.89	2.1/0.91

Table 7. Sensitivity analysis of key parameters used in the watermarking method.

Parameter	Values Tested	BER (%)	PSNR (dB)
$Semantic weight (γ)$	0.2/0.3/0.4	0.21/0.09/0.07	52.7/51.2/49.4
Quantization scaling factor (Q)	0.5/0.75/1.0	0.27/0.10/0.09	53.1/52.1/49.3
Block threshold (τ)	25/30/35	0.10/0.08/0.08	52.3/50.9/48.2

Table 8. Computational Cost Breakdown.

Stage	Avg. Time (CIF)	Avg. Time (HD)	% Total Time	Parallelizable
Perceptual Map	1.6 s	4.3 s	45%	Yes
Keyframe Detection + Redundancy	0.5 s	1.2 s	15%	Partial
QIM Embedding (Spatial-DCT)	0.4 s	0.9 s	10%	Yes
Extraction (sync + voting)	0.7 s	1.6 s	30%	Yes

Table 9. Comparison with existing video watermarking methods under common attacks ¹.

Method	Domain	BER (%)	PSNR (dB)	SSIM	VMAF	Runtime (HD)	Comments
Proposed method	Spatial + JND	1.03	50.1	0.996	97.3	~7.8 s/frame	Low complexity, perceptual model, blind
DvMark (2023) [23]	Frequency + DL	1.6	46.4	0.982	93.5	~2.5 s/frame (GPU)	DL-based, high GPU demand
Hide-and-Track (2024) [24]	Frequency + DL	1.4	44.9	0.974	91.1	~2.3 s/frame (GPU)	DL-based, blind, less perceptually tuned
Yang et al. (2024) [17]	Hybrid (DCT)	2.1	43.5	0.967	89.7	~12 s/frame	Resilient to transcoding, slow
Huan et al. (2022) [19]	DT-CWT	1.7	45.1	0.972	90.6	~9.6 s/frame	Strong under cropping, complex
Cedillo-H. et al. (2024) [25]	Spatial + JND	2.5	48.0	0.990	95.5	~5.6 s/frame	Fast but no temporal redundancy

¹ CIF, 4CIF, and HD video averages under Gaussian noise, H.264 compression, and cropping.

Table 10. Statistical analysis across 10 independent runs using different watermark sequences.

Video	Metric	Average	Standard Deviation (σ)
Akiyo	BER (%)	0.11	0.04
	PSNR (dB)	51.2	0.18
	SSIM	0.991	0.006
	VMAF	96.7	0.90
Foreman	BER (%)	0.15	0.05
	PSNR (dB)	50.8	0.22
	SSIM	0.992	0.009
	VMAF	97.8	1.20
Suzie	BER (%)	0.09	0.03
	PSNR (dB)	49.6	0.16
	SSIM	0.989	0.004
	VMAF	97.1	0.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cedillo-Hernandez, A.; Velazquez-Garcia, L.; Cedillo-Hernandez, M.; Dominguez-Jimenez, I.; Conchouso-Gonzalez, D. Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy. Mathematics 2025, 13, 2493. https://doi.org/10.3390/math13152493

AMA Style

Cedillo-Hernandez A, Velazquez-Garcia L, Cedillo-Hernandez M, Dominguez-Jimenez I, Conchouso-Gonzalez D. Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy. Mathematics. 2025; 13(15):2493. https://doi.org/10.3390/math13152493

Chicago/Turabian Style

Cedillo-Hernandez, Antonio, Lydia Velazquez-Garcia, Manuel Cedillo-Hernandez, Ismael Dominguez-Jimenez, and David Conchouso-Gonzalez. 2025. "Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy" Mathematics 13, no. 15: 2493. https://doi.org/10.3390/math13152493

APA Style

Cedillo-Hernandez, A., Velazquez-Garcia, L., Cedillo-Hernandez, M., Dominguez-Jimenez, I., & Conchouso-Gonzalez, D. (2025). Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy. Mathematics, 13(15), 2493. https://doi.org/10.3390/math13152493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust JND-Guided Video Watermarking via Adaptive Block Selection and Temporal Redundancy

Abstract

1. Introduction

Main Contributions

2. Related Works

2.1. Overview of the Original Method

2.2. Redundant Encoding Strategies in Watermarking

2.3. Block Selection for Robust Embedding in Video Watermarking

3. Proposed Method

3.1. General Architecture

3.1.1. Step 1: Perceptual Map Computation

3.1.2. Step 2: Robust Block Selection

3.1.3. Step 3: Calculate DCT Coefficients Directly from Spatial Domain

3.1.4. Step 4: JND Modulation and Embedding with QIM

3.1.5. Step 5: Redundancy Encoding and Temporal Distribution

3.1.6. Step 6: Watermark Extraction and Voting

3.2. Block-Based Watermark Embedding Scheme

3.3. Watermark Extraction Procedure

4. Experimental Results

4.1. Configuration

4.2. Baseline Performance Evaluation

4.2.1. Impact of Adaptive Modulation

4.2.2. Impact of Redundant Coding

4.2.3. Impact of Multiframe Insertion Strategy

4.3. Evaluation Under Attacks

4.4. Sensitivity Analysis

5. Analysis and Discussion

5.1. Error and Recovery Analysis

5.2. Interaction Between Redundancy and Modulation

5.3. Computational Cost Evaluation

5.4. Comparison with State-of-the-Art Methods

5.5. Statistical Robustness

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI