Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models

Jing, Yulin; Wu, Lijun; Su, Kaile; Wu, Wei; Li, Zhiyuan; Deng, Qi

doi:10.3390/math14061016

Open AccessArticle

Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models

by

Yulin Jing

^1,*

,

Lijun Wu

¹,

Kaile Su

²

,

Wei Wu

³,

Zhiyuan Li

⁴ and

Qi Deng

⁵

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610052, China

²

School of Information and Communication Technology, Griffith University, Brisbane, QLD 4222, Australia

³

School of Computer Science and Engineering, Central South University, Changsha 410083, China

⁴

Department of Electrical Engineering and Automation, Aalto University, P.O. Box 15400, 00076 Helsinki, Finland

⁵

School of Engineering and Computer Science, Victoria University of Wellington, P.O. Box 600, Wellington 6140, New Zealand

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(6), 1016; https://doi.org/10.3390/math14061016

Submission received: 1 January 2026 / Revised: 6 March 2026 / Accepted: 6 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue AI Security and Edge Computing in Distributed Edge Systems)

Download

Browse Figures

Versions Notes

Abstract

In recent years, video recognition models have witnessed the rapid development of Deep Neural Networks (DNNs). However, these models remain not robust to adversarial examples that are created by adding imperceptible perturbations to clean samples. Recent studies indicate that generating adversarial examples in the hard-label black-box setting is particularly challenging yet highly practical. Compared to image recognition models, there are few hard-label black-box adversarial example generation algorithms for video recognition models. To this end, we propose a hard-label black-box video adversarial example generation algorithm, referred to as Dynamic Black-box Algorithm (DBA). First, DBA uses the binary search algorithm to find the boundary video between two original videos; then, the sampling-based algorithm is used to estimate the gradient on the boundary video; finally, with a dynamic step size adjustment strategy, DBA moves the boundary video towards the direction of the estimated gradient to generate the adversarial video. Additionally, we designed another strategy to skip invalid samples generated during the adversarial example generation process. Experiments demonstrate that DBA attains a superior trade-off between the magnitude of perturbations and query efficiency. Specifically, DBA outperforms state-of-the-art algorithms, achieving an average reduction in Mean Squared Error (MSE) of over 50%.

Keywords:

adversarial examples; perturbations; neural networks; gradient estimation

MSC:

68T07

1. Introduction

As shown in Figure 1, Deep Neural Networks (DNNs) consist of the input layer, the hidden layer, and the output layer [1]. The input layer collects the input data, which is then processed in the hidden layer where most of the calculations occur. The output layer generates the final outputs of DNNs, which can be predictions, classifications, or probabilities. To prevent overfitting, an effective technique called dropout is used in DNNs [2], which randomly drops some nodes of DNNs during training. DNNs have advanced rapidly in recent years [3], yet they still face the challenges from adversarial examples, which are initially discovered in image recognition and crafted by injecting imperceptible perturbations into clean samples [3,4]. When processed by DNNs, these adversarial examples can trigger erroneous classification results that diverge significantly from the original predictions. As depicted in Figure 1, the classification label of the clean sample is “smile”; after introducing imperceptible perturbations, the clean sample becomes the adversarial example. The adversarial example is indistinguishable from the clean sample, yet its classification label is “smoke”—completely different from the clean sample’s classification label. Essentially, adversarial examples expose potential security vulnerabilities of DNNs [5], which restricts the application prospects of DNNs in the security-critical fields [6]. Therefore, an in-depth study of adversarial example generation algorithms can help researchers find the vulnerabilities of DNNs proactively so that they can take some measures (such as adversarial training) to enhance the security of DNNs [7], which is significantly important for improving the robustness of DNNs [8]. Hence, an increasing number of researchers are devoted to the field of adversarial example generation algorithms [9,10,11].

To generate high-quality adversarial examples, the process of adding perturbations to clean samples usually needs to test the DNN-based model (also called the victim model or the target model) iteratively so that the perturbation sampling strategy can be adjusted dynamically according to the target model’s output [12,13]. As illustrated in Figure 2, depending on how much information can be obtained from the target model, adversarial example generation can be categorized into three types: the white-box setting, the score-based black-box setting, and the hard-label black-box setting [14,15]. In the white-box setting, the researcher has full access to the target model’s information, including the network structure, parameters, probabilities, and labels [16]. In the score-based black-box setting, the researcher can only query the target model for probabilities and labels [17]. In the hard-label black-box setting, the researcher can only query the target model for labels (also called hard labels) [18]. Among these categories, the hard-label black-box setting is widely regarded as the most practical yet challenging [18]. In real-world scenarios, commercial models (e.g., MEGVII Face++ and Microsoft Azure) typically only return labels, withholding internal information such as the network architecture, parameters, and probabilities.

Since adversarial examples originate from image recognition, there are more studies in this field [19,20,21]. However, with the rapid development of short-video applications like TikTok and Kuaishou, videos have become the primary information transmission medium in the internet era, and more and more video recognition models based on DNNs are proposed [22,23,24]. Yet, compared with image recognition models, there are few adversarial example generation algorithms for video recognition models [25,26]. Most current video adversarial example (also called adversarial video) generation algorithms focus on the white-box setting and the score-based black-box setting [27,28], and the hard-label black-box setting is rarely explored. In the white-box setting, SVASTIN [28] is a novel sparse video adversarial example generation algorithm, which consists of the GTVL [28] module and the STIN [28] module. SVASTIN generates adversarial videos by exchanging the spatio-temporal feature space information. In the score-based black-box setting, VBAD [29] is the first score-based black-box video adversarial example generation algorithm. VBAD takes advantage of the migration of the perturbations from image models to video models and initializes the video perturbations with the image perturbations. In order to correct the perturbation deviation between image models and video models, VBAD divides the frames into many small parts and then uses the NES [29] algorithm to estimate the gradient of these small parts. EARL [30] uses the agent of reinforcement learning to select the keyframes of videos, and only adds perturbations to these keyframes, which reduces the magnitude of perturbations and improves the efficiency of adversarial example generation. In the hard-label black-box setting, STDE [31] is the state-of-the-art hard-label black-box video adversarial example generation algorithm, which has the best performance in this field. STDE introduces target videos as initial perturbations and only adds perturbations on keyframes that are adaptively selected by the temporal difference algorithm. However, one drawback of STDE is that its adversarial examples contain too many perturbations that can even be easily recognized by human vision.

In summary, current adversarial example generation algorithms for video recognition models still face the challenge of perturbation optimization. Therefore, we introduce Dynamic Black-box Algorithm (DBA), a novel hard-label black-box video adversarial example generation algorithm designed to address this gap. In some security-critical video recognition scenarios, video adversarial examples may pose significant security risks. For example, surveillance models could be illegally bypassed, or autonomous driving models might cause traffic accidents due to misclassifications [29]. The purpose of our study is to generate high-quality adversarial examples for video recognition models in the hard-label black-box setting, thereby helping researchers find the vulnerabilities of these models proactively so that they can take some measures (such as adversarial training with adversarial videos generated by DBA) to enhance the security of these models. This is crucial for improving the robustness of video recognition models. DBA focuses on the hard-label black-box setting, which is widely regarded as the most practical yet challenging [18]. In this setting, commercial models typically only return labels, withholding internal information such as the network architecture, parameters, and probabilities. Our main challenges are as follows:

(1) Efficient gradient estimation in the hard-label black-box setting. While gradient estimation is effective for optimizing perturbations, the high dimensionality of videos poses a severe “challenges of dimensionalities” [30]. Existing algorithms primarily target the white-box setting or the score-based black-box setting [29,30]; efficiently estimating gradients using only hard labels remains an open challenge.

(2) Efficient step size adjustment for video data. When generating adversarial examples, it is necessary to move boundary samples along the estimated gradient direction with specific step sizes. In the hard-label black-box setting, several step size adjustment strategies have been developed for image models [32,33]. But these strategies are only applicable to low-dimensional image data and perform poorly when handling high-dimensional video data.

(3) Robustness against invalid samples. The vast search space in videos frequently leads to “invalid samples” during the iterative process, which can stall or terminate gradient estimation. Designing a robust strategy to bypass these invalid samples is essential for stable optimization.

To address the above three challenges, we designed the following algorithms and strategies for DBA, which are also our main contributions:

(1) We used a sampling-based algorithm to estimate the gradient on the boundary video in the hard-label black-box setting. By introducing two lemmas and one theorem, we ultimately proved the effectiveness of this gradient estimation algorithm.

(2) We designed a new step size adjustment strategy to generate adversarial videos, which can dynamically adjust the step size as the model queries change during the adversarial example generation process. Experiments show that our strategy can help to generate adversarial videos with fewer model queries.

(3) We designed a strategy to skip invalid samples generated during the adversarial example generation process. By this strategy, DBA can find valid samples quickly with very few model queries. Extensive experiments demonstrate that when generating adversarial videos of the same quality, the model queries of DBA are 40% lower than those of other comparison algorithms on average.

Table 1 provides an explanation of the main symbols used in our paper.

In the field of image recognition, several hard-label black-box adversarial example generation algorithms have been proposed [32,33]. DBA differs from these algorithms. First, these algorithms are only applicable to low-dimensional image data and perform poorly when handling high-dimensional video data. Related research [34,35] shows that an efficient step size adjustment strategy can improve algorithm performance. Inspired by these studies, we designed a new step size adjustment strategy. This strategy enables the generation algorithm to dynamically adjust the step size based on the stage of the adversarial example generation process, thereby improving the efficiency of generating adversarial videos. Additionally, the low dimensions of images eliminate the problem of invalid samples. But due to the high dimensionality of video data, it is inevitable to generate invalid samples, and these algorithms designed for image models cannot address this problem. To this end, we designed a strategy to skip invalid samples, which can find valid samples quickly with very few model queries.

2. Proposed Framework

We denote the video recognition model as

f (x) : X \to Y

;

x \in X \subset R^{N \times W \times H \times C}

and

y \in Y = {1, 2 . \dots . K}

, where x and y represent the video and the model’s output, respectively.

N, W, H, C

denote the number of frames, frame width, frame height, and the number of channels, respectively; K represents the number of classes. The goal of generating the adversarial example for the video recognition model is to produce an adversarial video

x_{a d v}

that can fool f into making misclassifications, that is,

f (x_{a d v}) = y_{a d v}

, where

y_{a d v} \neq y

. In order to ensure that x and

x_{a d v}

are undistinguishable for human vision, the researcher will keep

x_{a d v}

within a ball centered at the clean video x (

{∥x_{a d v} - x∥}_{p} \leq ϵ_{a d v}

).

In the hard-label black-box setting, the researcher can only get the hard label from f. Other information of f is not accessible, such as parameters, the network architecture, probabilities, etc. First, the researcher chooses a source video

x_{s r c}

and a target video

x_{t g t}

, and their labels are

y_{a d v}

and y, respectively. The researcher’s purpose is to get an adversarial video that is “visually close” with

x_{t g t}

but misclassified as

y_{a d v}

(also called the adversarial label). So the researcher will move

x_{s r c}

towards

x_{t g t}

so that they are indistinguishable for human vision while keeping

x_{s r c}

classified as

y_{a d v}

by f. During this process, the videos that are classified as

y_{a d v}

and located at the decision boundary between the two classes (e.g.,

y_{a d v}

and y) are called boundary videos.

The general framework of the DBA is shown in Figure 3. The videos shown in Figure 3 are collected from the HMDB-51 [36] dataset. We sampled some frames from the original videos at even frame intervals and preprocessed these sampled frames using the resizing function and the cropping function. After preprocessing the original videos according to the above steps, we finally get the video samples that can be directly input into the target model. Our purpose is to get an adversarial video that is “visually close” with “punch” (target video) but misclassified as “smile” (source video). First of all, we begin with two videos (the source video and the target video) together with a sampling subspace that obeys the Gaussian distribution, and then iteratively conduct the following three steps:

(1) Finding a boundary video at the decision boundary between the two classes (e.g., “smile” and “punch”), which is used to provide a boundary video to estimate the gradient.

(2) Calculating the next movement direction by estimating the gradient, which is used to estimate the gradient on the boundary video.

(3) Moving along the estimated gradient direction, which is used to move the boundary video along the estimated gradient to generate the adversarial video.

Next, we will describe the three steps in detail.

2.1. Finding a Boundary Video at the Decision Boundary

For DNNs, there is a decision boundary between any two samples belonging to different labels [32]. As shown in Figure 4,

x_{s r c}

and

x_{t g t}

are two videos whose labels are

y_{a d v}

and y, respectively. There is a decision boundary between

x_{s r c}

and

x_{t g t}

. The boundary video

x_{b o d}

lying on the decision boundary has the features of both

x_{s r c}

and

x_{t g t}

. If we add some small random perturbations to

x_{b o d}

, its classification label will randomly oscillate between

y_{a d v}

and y, which indicates that

x_{b o d}

is highly sensitive to the random perturbations. In contrast, if we add the same perturbations to

x_{s r c}

or

x_{t g t}

, their classification labels are likely to remain unchanged because they are farther from the decision boundary. When estimating the gradient, the more sensitive the samples are to the perturbations, the more accurate the estimated gradient is [33]. Therefore, if

x_{b o d}

is used to estimate the gradient, we will get more accurate gradients. We use the binary search algorithm [32] to find the new boundary video between the target video and the previous adversarial video:

x_{b o d}^{t} = α \cdot x_{a d v}^{t - 1} + (1 - α) \cdot x_{t g t},

(1)

where

x_{b o d}^{t}

is the new boundary video;

x_{a d v}^{t - 1}

is the previous adversarial video generated in the

(t - 1)

-th iteration;

x_{t g t}

is the target video; and

α

is the parameter of the binary search algorithm. At the beginning, we set

x_{a d v}^{0} = x_{s r c}

. With Equation (1), we can adjust

α

so that

x_{b o d}^{t}

simultaneously has the features of both

x_{a d v}^{t - 1}

and

x_{t g t}

, which is the essence of the binary search algorithm.

2.2. Calculating the Next Movement Direction by Estimating the Gradient

Above all, we define the probability difference function

C S

and the sign function

ϕ

as follows:

C S (x) = {H (x)}_{y_{a d v}} - m a x [{H (x)}_{y \neq y_{a d v}}],

(2)

ϕ (x) = s i g n (C S (x)) = {\begin{matrix} (3) & - 1, if C S (x) < 0, \\ (4) & 1, otherwise, \end{matrix}

where

H {(x)}_{y}

represents the probability of the label y. In the hard-label black-box setting, the researcher can only obtain the value of

ϕ

and cannot get the value of

C S

.

We next try to estimate the gradient on the boundary video. Inspired by relevant research [32,33], we employ the sampling-based algorithm [33] to estimate the gradient:

\tilde{\nabla C S} (x_{b o d}^{t}) = \frac{1}{N} \sum_{k = 1}^{N} ϕ (x_{b o d}^{t} + δ u) u,

(5)

where

x_{b o d}^{t}

is the boundary video produced in the t-th iteration; N is the number of the sampled perturbations;

δ

is a small constant; u is the perturbation sampled from the sampling subspace that obeys the Gaussian distribution; and

\tilde{\nabla C S}

is the estimated gradient.

The function

C S

represents the difference between two probability values, and the value of

C S

lies in the range [−1, 1]. Related research [32,33] indicates that

C S

can be considered Lipschitz continuous in the field of hard-label black-box adversarial example generation. If the value of

C S

can be obtained, its gradient can be estimated by calculating the sensitivity of

C S

to perturbations. However, the value of

C S

cannot be directly obtained in the hard-label black-box setting; therefore, as shown in Equation (5), we estimate the gradient of

C S

by evaluating the sensitivity of

ϕ

to perturbations.

Inspired by related research [18,32,33], we next prove the validity of gradient estimation. We first give Lemmas 1 and 2, which are used to prove Theorem 1:

Lemma 1.

Suppose that

\nabla C S

is the true gradient and

C S (x_{b o d}^{t}) = 0

, then

C S (x_{b o d}^{t} + δ u)

can be rewritten as

\begin{matrix} C S (x_{b o d}^{t} + δ u) = δ \nabla C S {(x_{b o d}^{t})}^{⊤} u + \frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x^{'}) u, \end{matrix}

(6)

where

x^{'}

is a video between

x_{b o d}^{t}

and

x_{b o d}^{t} + δ u

.

Lemma 2.

Suppose that we have a function as follows:

\begin{matrix} f (r) = \frac{1 - r}{\sqrt{1 - 2 r + 2 r^{2}}} . \end{matrix}

(7)

Then the following inequality holds:

\begin{matrix} f (r) \geq 1 - \frac{1}{2} r^{2} . \end{matrix}

(8)

Based on Lemmas 1 and 2, we formally present Theorem 1.

Theorem 1.

Suppose that

\nabla C S

is the true gradient and

C S (x_{b o d}^{t}) = 0

, we have

\begin{matrix} cos ∠ (E [ϕ (δ u + x_{b o d}^{t}) u], \nabla C S (x_{b o d}^{t})) \geq 1 - \frac{9 L^{2} δ^{2} {(d - 1)}^{2}}{8 ∥ \nabla C S (x_{b o d}^{t}) ∥_{2}^{2}} . \end{matrix}

(9)

When

δ \to 0

, we also have

\begin{matrix} \lim_{δ \to 0} cos ∠ (E [\tilde{\nabla C S} (x_{b o d}^{t})], \nabla C S (x_{b o d}^{t})) = 1, \end{matrix}

(10)

where

\tilde{\nabla C S}

is the estimated gradient; d is the number of pixels of videos; and L is the Lipschitz continuity coefficient of

C S

.

Theorem 1 indicates that the direction of the estimated gradient aligns with the direction of the true gradient direction.

In summary, the gradient estimation method described by Equation (5) is essentially a mean-based algorithm that evaluates the significance of perturbations. First, sampled perturbations are assigned different weights based on their impact on the target model: perturbations that cause the target model to produce the adversarial label are assigned a weight of 1, while others receive a weight of −1. Next, perturbations are multiplied by their respective weights to form temporary results. Finally, the mean of all these temporary results is regarded as the final estimated gradients. Because perturbations are introduced during the gradient estimation process, this method inevitably yields errors. Theorem 1 indicates that these errors can be controlled within a certain range, and the fewer perturbations added, the higher the accuracy of the estimated gradient.

See Appendix A, Appendix B and Appendix C for the proof of Lemmas 1 and 2, and Theorem 1.

2.3. Moving Along the Estimated Gradient Direction

The estimated gradient provides the moving direction for the boundary video. Inspired by research on image recognition models [34,37,38,39], we try to update the boundary video towards the direction of the estimated gradient to generate the new adversarial video:

x_{a d v}^{t + 1} = x_{b o d}^{t} + η \frac{\tilde{\nabla C S} (x_{b o d}^{t})}{{∥ \tilde{\nabla C S} (x_{b o d}^{t}) ∥}_{2}},

(11)

where

x_{a d v}^{t + 1}

is the new generated adversarial video, and

η

is the step size. In practice, we observe that the original step size adjustment strategy for image recognition models is inefficient in video recognition scenarios, because this strategy fails to take into consideration the characteristics at different stages of the adversarial example generation process. Generally speaking, at the initial stages, the distance between the target video and the adversarial video is large, and the gradient estimation algorithms can easily yield relatively accurate gradients, so we should increase the step size. As the adversarial example generation process goes on, the distance between the target video and the adversarial video becomes small, and the estimated gradient also becomes inaccurate; therefore, the step size should be reduced. The original step size adjustment strategy fails to take into account the variation process of the gradient accuracy. To address this limitation, we propose a novel step size adjustment strategy based on the function

ψ

:

ψ (q) = m a x (v_{m i n}, v_{i n i} - \frac{q}{d}),

(12)

η_{n e w} = ψ (q) η,

(13)

x_{a d v}^{t + 1} = x_{b o d}^{t} + η_{n e w} \frac{\tilde{\nabla C S} (x_{b o d}^{t})}{{∥ \tilde{\nabla C S} (x_{b o d}^{t}) ∥}_{2}},

(14)

where q denotes the current model queries;

v_{i n i}

is the initial parameter;

v_{m i n}

is the minimum value of

v_{i n i}

; and d is the decay factor of

v_{i n i}

. With q increasing,

ψ (q)

gradually becomes small, so if we set

v_{m i n} > 1

, our improved step size adjustment strategy can dynamically adjust the step size as the model queries change during the adversarial example generation process.

Relevant research [32,33] indicates that in the field of hard-label black-box adversarial example generation, as the number of iterations increases, adversarial examples become increasingly similar to target samples and the estimated gradients are also more and more inaccurate. Therefore, the step size should be reduced. However, previous step size adjustment strategies [32,33] did not take into account the stage of the adversarial example generation process, so they are relatively inefficient. To address this limitation, we introduce three parameters—

v_{i n i}

,

v_{m i n}

, and d—into our step size adjustment strategy.

v_{i n i}

represents the maximum step size. d is used to adjust the decay rate of the step size using

v_{m i n}

and q, and

v_{m i n}

ensures that the step size does not decrease to an excessively small value (since the dimensions of videos are higher than those of images, too small step sizes are inefficient in video scenarios).

Because of the high dimensionalities of the videos, it is inevitable to generate invalid samples during the adversarial example generation process, which will terminate the process of the gradient estimation. After analyzing, we find that the reason for this phenomenon is that a large

η_{n e w}

is used when generating adversarial examples. But if we explore the new

η_{n e w}

in the vast search space, it will consume a large number of model queries. So we design a strategy that explores the new

η_{n e w}

within a specific small range randomly, that is, when we get invalid samples,

η_{n e w}

will be recalculated:

η_{n e w} = η_{n e w} \cdot b \cdot v,

(15)

where v is a random number sampled from the Gaussian distribution, and b is a small number. This strategy confines the exploration space of the new

η_{n e w}

within a small ball of radius

η_{n e w} \cdot b

, and b is used to control the radius of the ball. Our strategy aims to help DBA find valid samples quickly; when the algorithm starts its next iteration,

η_{n e w}

should be set to its original value according to Equations (12) and (13). b is highly sensitive to the number of iterations. Fewer iterations imply a larger difference between the adversarial video and the target video, making it easier to escape the invalid sample space quickly. Experiments show that this strategy has relatively stable performance, which can find valid samples within 20 query attempts. After detailing the three steps of DBA, we formally present the complete description of DBA in Algorithm 1.

Algorithm 1 DBA

Input: The source video

x_{s r c}

, the target video

x_{t g t}

, the target model f, and the adversarial label

y_{a d v}

.
Output: The adversarial video

x_{a d v}^{t}

.

1:: Initialize the maximum iteration number T, the iteration variable t, the binary search threshold h, the step size $η$ , the constant $δ$ , the constant b, the variable v, and the number of perturbations N
2:: $t \leftarrow 1$
3:: $x_{a d v}^{t - 1} \leftarrow x_{s r c}$
4:: $m \leftarrow 0$
5:: while $t \leq T$ do
6:: $x_{b o d}^{t} \leftarrow F i n d B o d V i d e o (x_{a d v}^{t - 1}, x_{t g t}, h, f, y_{a d v})$
7:: $u \leftarrow G a u s s i a n ()$
8:: $\tilde{\nabla C S} (x_{b o d}^{t}) \leftarrow \frac{1}{N} \sum_{k = 1}^{N} ϕ (x_{b o d}^{t} + δ u) u$
9:: $η_{n e w} \leftarrow ψ (q) η$
10:: $x_{a d v}^{t} \leftarrow x_{b o d}^{t} + η_{n e w} \frac{\tilde{\nabla C S} (x_{b o d}^{t})}{{∥ \tilde{\nabla C S} (x_{b o d}^{t}) ∥}_{2}}$
11:: while $f (x_{a d v}^{t}) \neq y_{a d v}$ do
12:: $v \leftarrow G a u s s i a n ()$
13:: $η_{n e w} \leftarrow η_{n e w} \cdot b \cdot v$
14:: $x_{a d v}^{t} \leftarrow x_{b o d}^{t} + η_{n e w} \frac{\tilde{\nabla C S} (x_{b o d}^{t})}{{∥ \tilde{\nabla C S} (x_{b o d}^{t}) ∥}_{2}}$
15:: end while
16:: $t \leftarrow t + 1$
17:: end while
18:: return $x_{a d v}^{t}$

In Algorithm 1, line 5 to line 17 represent the main loop of DBA; T denotes the maximum number of iterations. Specifically, the function FindBodVideo on line 6 is used to compute the boundary video according to Equation (1); the function Gaussian on line 7 is used to sample random perturbations; line 8 performs gradient estimation according to Equation (5); line 9 to line 10 represent our improved dynamic step size adjustment strategy; line 11 to line 15 denote the invalid sample skipping strategy.

3. Experiments

3.1. Algorithm Competitors

We design the algorithm comparison as follows:

(1) Hard-label black-box algorithm comparison. In order to evaluate the performance of our algorithm, we compare DBA with STDE [31], which is the state-of-the-art hard-label black-box video adversarial example generation algorithm.

(2) Score-based black-box algorithm comparison. Currently, there exist some state-of-the-art score-based black-box adversarial example generation algorithms on video recognition models: VBAD [29] and EARL [30]. We also compare DBA with these two algorithms. This comparison is particularly challenging as score-based algorithms leverage significantly more information (i.e., probabilities) than hard-label algorithms. Consequently, evaluating DBA with these stronger algorithms underscores the robustness and efficiency of DBA under the most restrictive information settings.

Currently, there are some state-of-the-art image-to-video transfer adversarial example generation algorithms [40,41]. We do not compare DBA with them, because these algorithms focus on the white-box setting, while DBA targets the hard-label black-box setting. Therefore, DBA and image-to-video transfer adversarial example generation algorithms represent two entirely different research fields. However, the transferability of adversarial examples provides the improvement direction for us. In the future, we will explore the transferability of hard-label black-box adversarial examples to improve the performance of DBA.

3.2. Experimental Settings

In our experiments, the human motion recognition dataset HMDB-51 [36] and UCF-101 [42] are used. HMDB-51 is a human motion recognition dataset which includes 7000 videos belonging to 51 categories. Of these, 70% of the videos are divided into training sets, and 30% of the videos are testing sets. HMDB-51 is one of the most commonly used datasets in the field of video recognition. UCF-101 is an action recognition dataset from realistic action videos, which comes from 13,320 trimmed YouTube videos belonging to 101 action categories. Of these, 80% of the videos are divided into training sets, and 20% of the videos are testing sets. UCF-101 is also one of the most widely used benchmark datasets by the state-of-the-art video recognition models.

As for video recognition models, C3D [43] and TSN [44] are used in our experiments. C3D studies spatio-temporal features of the videos by 3D convolution layers, and it is also recognized as a popular benchmark in the field of video recognition. TSN is a novel framework for the video-based action recognition, which is based on the idea of long-range temporal structure modeling. TSN combines the sparse temporal sampling strategy and the video-level supervision strategy to improve its performance.

We use the following metrics to evaluate algorithm performance:

(1) Mean Square Error (MSE), which is the square error distance between the target video and the adversarial video in every iteration. The MSE indicates the magnitude of perturbations. The lower the MSE is, the more similar the adversarial video is to the target video.

(2) Success Rate (SRT), which is the ratio of successful adversarial videos to all test videos. The MSE and model queries of these successful adversarial videos are below specific thresholds, respectively. The higher the SRT is, the more efficient the algorithm is.

(3) Mean Query Numbers (MQNs), which are the average model queries of all successful adversarial videos. The lower the MQN is, the better the algorithm is.

(4) Peak Signal to Noise Ratio (PSNR), which quantifies the ratio of the maximum pixel intensity to the global perturbations. A higher PSNR indicates better performance.

(5) Average Occluded Area (AOA), which is the ratio of the added perturbations to the distance between the source video and the target video. A lower AOA indicates better performance.

(6) Query Attempts (QA). When invalid samples are found, DBA will consume some model queries to escape the invalid sample space. We refer to these model queries as query attempts. A lower QA indicates better performance.

PSNR focuses on analyzing the impact of perturbations on samples from the perspective of perturbation intensity [45]. A higher value indicates fewer perturbations and better visual quality. AOA emphasizes analyzing the influence of perturbations from the perspective of their impact range. A higher value means a larger affected area and poorer visual quality [31].

In order to better highlight the performance of all comparison algorithms, we set the maximum model queries to 30,000, which is a challenge for all algorithms, because in the field of black-box video adversarial example generation, the maximum model queries of most algorithms are much higher than 30,000. For example, EARL and VBAD set this value to 300,000 and 600,000, respectively, which is 10 times and 20 times as much as ours. Relevant studies [32,33] show that when setting large model queries budgets, all comparison algorithms can succeed, which makes it hard to distinguish the performance of different algorithms. Therefore, the maximum model queries should be set as small as possible. This way of setting parameters is widely used in the hard-label black-box adversarial example generation [31,32,33]. Generating adversarial examples with fewer model queries also indicates that the algorithm has the capability to rapidly identify the vulnerabilities of the target model, which holds practical significance. Additionally, in real-world scenarios, the target model may incorporate an attack detection framework. Therefore, reducing maximum model queries can help algorithms avoid detection by the attack detection framework, which is of practical significance. Researchers can also set higher query budgets for a more in-depth and detailed assessment of the target model’s security. Moreover, we set

v_{i n i}

= 20,

v_{m i n}

= 5, b = 1, and d = 200. When setting these experimental parameters, we drew upon previous relevant studies [29,30,31], and these parameters remained unchanged through the entire process of the experiment.

3.3. Results and Analysis

3.3.1. Magnitude of Perturbations

In this section, we will evaluate the magnitude of perturbations contained in adversarial videos generated by different algorithms using three metrics: MSE, PSNR, and AOA.

First, we show the average MSE of all successful adversarial videos of two target models during the adversarial example generation process with different model queries in Figure 5. As illustrated in Figure 5, DBA achieves a reduction in the magnitude of perturbations by 50–70% compared to STDE.

Figure 6 shows the PSNR of comparison algorithms at different model queries. It can be observed that the PSNR of DBA is 20% higher than that of STDE on average. Figure 7 shows the AOA of comparison algorithms at different model queries. It can be concluded that the AOA of DBA is 50–70% lower than that of STDE on average.

We do not draw the curves of EARL and VBAD in Figure 5, Figure 6 and Figure 7 because they are score-based algorithms and produce only one successful adversarial video if they succeed. In other words, EARL and VBAD do not generate other successful adversarial videos during the adversarial example generation process. In order to compare with them fairly, we analyze all common successful adversarial videos of the four algorithms, and the results are shown in Table 2 and Table 3 (“-” means that the corresponding algorithm cannot generate adversarial videos successfully within 30,000 model queries). It can be seen that DBA gets the best performance with the fewest model queries. From Figure 5, Figure 6 and Figure 7, we can also conclude that STDE can generate low-quality adversarial videos rapidly but cannot produce high-quality adversarial videos.

We can observe that DBA has better performance than STDE, which is caused by the differences of the algorithm architecture. Specifically, STDE is a patch-based algorithm, and DBA is a pixel-based algorithm, so DBA can add perturbations to pixels more accurately, while STDE uses the target video as the initial perturbations and adds it to the source video, and then gradually optimizes and reduces the magnitude of perturbations. However, one flaw of STDE is that it can only optimize perturbations within a specific threshold and cannot optimize them continuously.

In practical applications, adversarial videos with fewer perturbations tend to be of higher quality and less detectable by human vision. The experimental results of this section clearly show that the adversarial videos generated by DBA contain significantly fewer perturbations than those produced by other comparison algorithms. In other words, DBA can generate higher-quality adversarial videos, which is particularly important for evaluating the security of target models.

3.3.2. Success Rate

In this section, we will evaluate the success rate of all algorithms under different MSE and model queries.

When the MSE threshold is 25, comparison results for the success rate of two target models with different model queries are reported in Figure 8. We do not draw the curve of EARL in two target models experiments and the curve of VBAD in C3D experiments because the two algorithms cannot generate adversarial videos successfully on the corresponding model within 30,000 model queries. As shown in Figure 8, STDE cannot optimize adversarial videos during the entire adversarial example generation process. By contrast, DBA can continuously optimize adversarial videos with the model queries increasing, and its final success rate is 1.6–4.3 times higher than that of STDE. Although VBAD can also continuously optimize adversarial videos during the adversarial example generation process, its success rate is significantly lower than that of DBA.

In order to further evaluate the performance of all algorithms under different model queries and MSE thresholds, we get Figure 9 and Figure 10. As illustrated in Figure 9 and Figure 10, DBA gets the best performance in all test cases. Additionally, we can also conclude that there are no experimental results for EARL or VBAD in Figure 9, but these results appear in Figure 10. This phenomenon is caused by the following two reasons:

(1) EARL and VBAD have bad performance. The two algorithms are based on logits (probabilities). Compared with hard labels, logits have a wider value range, and if these algorithms want to get more accurate gradients, they must visit the target model frequently [29,30]. So the algorithms based on logits usually set higher maximum model queries. For example, VBAD and EARL set their maximum model queries to 300,000 and 600,000, respectively. In order to highlight the performance of DBA, in our experiments, we set the maximum model queries to 30,000, which means that VBAD and EARL cannot successfully generate adversarial videos sometimes. Experimental results similar to ours can also be obtained in STDE [31].

(2) VBAD behaves differently when testing different models. From the above experimental results, it can be seen that, when testing TSN, VBAD has better performance than that in C3D. This is caused by the difference between TSN and C3D. Specifically, TSN utilizes sparse frames for predictions; the perturbations of each frame have a greater impact [44]. Different from TSN, C3D mainly uses more frames for predictions, and the impact of each frame’s perturbations is relatively small [43]. VBAD uses image perturbations to initialize frame perturbations, and relevant research has proven that initializing perturbations in this way can improve efficiency [46,47,48]. When testing TSN, VBAD only needs to add image perturbations to sparse frames, which reduces the interference of adding image perturbations to a great many frames, so VBAD has better performance in TSN experiments than in C3D experiments.

In practical applications, a higher success rate indicates that the algorithm can generate adversarial videos rapidly. The experimental results of this section clearly denote that the success rate of DBA is significantly higher than that of other comparison algorithms. This means that DBA can generate adversarial videos with fewer query budgets, which is critically important for quickly discovering the target model’s security vulnerabilities.

3.3.3. Ablation Studies

DBA aligns N and

δ

with the standard hard-label black-box image adversarial example generation algorithm [32,33]. Next, we will perform ablation studies on the dynamic step size adjustment strategy and the invalid-sample skipping strategy. We set the maximum model queries to 400,000, and the MSE threshold is 25.

Effect of dynamic step size adjustment strategy. As shown in Figure 11a, when the dynamic adjustment strategy is removed, the success rate of the DBA drops from 96% to 6%. The primary reason is that traditional step size strategies are only applicable to low-dimensional image data and perform poorly when handling high-dimensional video data. Our designed strategy enables the generation algorithm to dynamically adjust the step size based on the stage of the adversarial example generation process, thereby improving the efficiency of generating adversarial videos. Figure 11b illustrates the success rate of DBA when

v_{i n i}

is set to different values. It can be observed that as

v_{i n i}

gradually increases, the success rate exhibits an initial increase followed by a decrease.

v_{i n i}

primarily determines the magnitude of the step size. Relevant research [34] indicates that excessively large or small step sizes can degrade the efficiency of adversarial example generation. Therefore, we set

v_{i n i}

= 20.

Effect of invalid-sample skipping strategy. As shown in Figure 12a, after removing the invalid-sample skipping strategy, the DBA’s query attempts increase from 12 to 125. This implies that our designed strategy can reduce query attempts by over 90%, which helps DBA find valid samples quickly. Figure 12b illustrates how query attempts vary when b is set to different values. b primarily determines the radius of the exploration sphere. It can be observed that as the radius increases, DBA tends to require more query attempts to escape the invalid sample space. Therefore, in real-world applications, b should be set to a small value.

The experimental results of this section clearly denote the effect of the invalid-sample skipping strategy and the dynamic step size adjustment strategy. These two strategies can help improve the efficiency of the adversarial video generation in practical applications.

3.4. Failure Cases and Future Directions

DBA is a pixel-based algorithm that does not select keyframes. It regards all video frames as a whole and dynamically optimizes the perturbations based on the estimated gradient. As shown in Figure 5, while DBA significantly outperforms existing algorithms in the long run, it exhibits a relatively slower optimization speed during the initial phase compared to keyframe-based algorithms like STDE. This phenomenon is primarily attributed to the fundamental difference in search space dimensionality: DBA operates on a pixel-level dense representation, whereas STDE constrains the perturbation to sparse keyframes. In the future, we will study how to apply keyframe selection methods in adversarial example generation algorithms and attempt to calculate keyframes of videos for DBA in the initial process. Additionally, we will also explore the transferability of hard-label black-box image-to-video adversarial examples to improve the efficiency of DBA.

4. Conclusions

This paper presents DBA, a novel hard-label black-box algorithm designed for generating adversarial examples on video recognition models. First, DBA uses a sampling-based algorithm to estimate the gradient on the boundary video. Second, by updating the boundary video with an adaptive dynamic adjustment strategy, DBA effectively addresses the challenges of high dimensionality and query inefficiency inherent in video-based adversarial example generation. Furthermore, a specialized invalid sample evasion strategy is introduced to ensure optimization stability under limited query budgets. Extensive experiments on benchmark datasets (HMDB-51 and UCF-101) demonstrate that DBA significantly outperforms state-of-the-art algorithms, achieving up to a 50% reduction in MSE while maintaining superior visual imperceptibility. DBA’s pixel-level optimization exhibits a slower initial optimization speed compared to heuristic keyframe-based algorithms. Future research will focus on integrating spatio-temporal keyframe selection algorithms and transferable perturbations into the DBA framework to further enhance query efficiency in the early optimization stages.

Author Contributions

Methodology, Y.J.; formal analysis, Y.J. and L.W.; writing—original draft, Y.J.; writing—review and editing, K.S., W.W., Z.L. and Q.D.; supervision, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Lemma 1

First, we introduce the function

g (t)

:

\begin{matrix} g (t) = C S (x_{b o d}^{t} + t u), t \in [0, δ] . \end{matrix}

(A1)

From Equation (A1), we have

\begin{matrix} g (0) = C S (x_{b o d}^{t}) = 0, \end{matrix}

(A2)

\begin{matrix} g (δ) = C S (x_{b o d}^{t} + δ u) . \end{matrix}

(A3)

Since

C S

is twice continuously differentiable, g is also twice continuously differentiable. After expanding

g (δ)

using a second-order Taylor expansion with Lagrange remainder term, we have

\begin{matrix} g (δ) = g (0) + g^{'} (0) \cdot δ + \frac{1}{2} g^{″} (η) \cdot δ^{2}, θ \in (0, δ), \end{matrix}

(A4)

where

\frac{1}{2} g^{″} (η) \cdot δ^{2}

is the Lagrange remainder term. Next, compute the first and second derivatives of Equation (A1):

\begin{matrix} g^{'} (t) = \nabla C S {(x_{b o d}^{t} + t u)}^{⊤} u, \end{matrix}

(A5)

\begin{matrix} g^{'} (0) = \nabla C S {(x_{b o d}^{t})}^{⊤} u, \end{matrix}

(A6)

\begin{matrix} g^{″} (t) = u^{⊤} \nabla^{2} C S (x_{t} + t u) u, \end{matrix}

(A7)

\begin{matrix} g^{″} (θ) = u^{⊤} \nabla^{2} C S (x_{b o d}^{t} + θ u) u . \end{matrix}

(A8)

Substituting Equations (A5)–(A8) into Equation (A4) yields

\begin{matrix} C S (x_{b o d}^{t} + δ u) = δ \nabla C S {(x_{b o d}^{t})}^{⊤} u + \frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x_{b o d}^{t} + θ u) u . \end{matrix}

(A9)

Let

x^{'} = x_{b o d}^{t} + θ u

. Since

θ \in (0, δ)

, and

x^{'}

lies between

x_{b o d}^{t}

and

x_{b o d}^{t} + δ u

. Thus,

\begin{matrix} C S (x_{b o d}^{t} + δ u) = δ \nabla C S {(x_{b o d}^{t})}^{⊤} u + \frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x^{'}) u . \end{matrix}

(A10)

Appendix B. Proof of Lemma 2

First, perform a second-order Taylor expansion of

f (r)

at

r = 0

:

\begin{matrix} f (r) = f (0) + f^{'} (0) r + \frac{1}{2} f^{″} (0) r^{2} + o (r^{2}) . \end{matrix}

(A11)

Following Equation (A11), we have

\begin{matrix} f (0) = \frac{1}{\sqrt{1}} = 1, \end{matrix}

(A12)

\begin{matrix} f^{'} (r) = \frac{- r}{{(1 - 2 r + 2 r^{2})}^{3 / 2}}, \end{matrix}

(A13)

\begin{matrix} f^{'} (0) = 0, \end{matrix}

(A14)

\begin{matrix} f^{″} (r) = - \frac{1}{{(1 - 2 r + 2 r^{2})}^{3 / 2}} - \frac{3 r (1 - 2 r)}{{(1 - 2 r + 2 r^{2})}^{5 / 2}}, \end{matrix}

(A15)

\begin{matrix} f^{″} (0) = - 1 . \end{matrix}

(A16)

Substituting Equations (A12)–(A16) into Equation (A11), we have

\begin{matrix} f (r) = f (0) + f^{'} (0) r + \frac{1}{2} f^{″} (0) r^{2} + o (r^{2}) \\ = 1 + 0 \cdot r + \frac{1}{2} \cdot (- 1) \cdot r^{2} + o (r^{2}) = 1 - \frac{1}{2} r^{2} + o (r^{2}) . \end{matrix}

(A17)

Omitting

o (r^{2})

from Equation (A17) yields

\begin{matrix} f (r) = \frac{1 - r}{\sqrt{1 - 2 r + 2 r^{2}}} \geq 1 - \frac{1}{2} r^{2} . \end{matrix}

(A18)

Appendix C. Proof of Theorem 1

First, by Lemma 1, we have

\begin{matrix} C S (x_{b o d}^{t} + δ u) = δ \nabla C S {(x_{b o d}^{t})}^{⊤} u + \frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x^{'}) u, \end{matrix}

(A19)

where

x^{'}

is a video between

x_{b o d}^{t}

and

x_{b o d}^{t} + δ u

, and

C S (x_{b o d}^{t}) = 0

. Since the gradient is Lipschitz continuous, we have

\begin{matrix} |\frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x^{'}) u| \leq \frac{1}{2} L δ^{2} . \end{matrix}

(A20)

If we set

w = \frac{1}{2} L δ

, then when

\nabla C S {(x_{b o d}^{t})}^{⊤} u > w

, we have

\begin{matrix} C S (x_{b o d}^{t} + δ u) = δ \nabla C S {(x_{b o d}^{t})}^{⊤} u + \frac{1}{2} δ^{2} u^{⊤} \nabla^{2} C S (x^{'}) u \\ \geq δ (\nabla C S {(x_{b o d}^{t})}^{⊤} u - \frac{1}{2} L δ) > 0 . \end{matrix}

(A21)

Similarly, when

\nabla C S {(x_{b o d}^{t})}^{⊤} u < w

, we have

C S (x_{b o d}^{t} + δ u) < 0

. Thus,

ϕ_{x} (x_{t} + δ u) = {\begin{matrix} (A22) & 1, if \nabla C S {(x_{b o d}^{t})}^{⊤} u > w \\ (A23) & - 1, if \nabla C S {(x_{b o d}^{t})}^{⊤} u > - w \end{matrix}

Normalize

\nabla C S (x_{b o d}^{t})

as

v_{1}

, then supplement with

v_{2}, \dots, v_{d}

to form an orthogonal basis, where d is the number of pixels of the video and

β

is the uniform distribution on the sphere:

\begin{matrix} v_{1} = \frac{\nabla C S (x_{b o d}^{t})}{∥ \nabla C S (x_{b o d}^{t}) ∥_{2}}, v_{2}, \dots, v_{d} u = \sum_{i = 1}^{d} β_{i} v_{i} . \end{matrix}

(A24)

Next, we divide the spherical region into three subregions:

\begin{matrix} E_{1} = {\nabla C S {(x_{b o d}^{t})}^{⊤} u > w}, \\ E_{2} = {| \nabla C S {(x_{b o d}^{t})}^{⊤} u | < w}, \\ E_{3} = {\nabla C S {(x_{b o d}^{t})}^{⊤} u < - w}, \end{matrix}

(A25)

where

E_{1}

and

E_{3}

represent the “upper cap” and “lower cap” of the sphere, respectively, and

E_{2}

is the middle region of the sphere. If p is the probability of the

E_{2}

region, by symmetry, we have

\begin{matrix} p = P (E_{2}), \\ P (E_{1}) = P (E_{3}) = \frac{1 - p}{2} . \end{matrix}

(A26)

Since only

v_{1}

aligns with the gradient direction,

v_{1}

is the principal axis of the sphere. Thus,

\begin{matrix} E [β_{i} ∣ E_{1}] = E [β_{i} ∣ E_{3}] = 0, i \neq 1 . \end{matrix}

(A27)

The expectation value for the entire sphere is

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{1}] P (E_{1}) + E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{2}] P (E_{2}) \\ + E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{3}] P (E_{3}) . \end{matrix}

(A28)

From Equations (A22) and (A23), we have

\begin{matrix} ϕ (x_{b o d}^{t} + δ u) = 1 when \nabla C S {(x_{b o d}^{t})}^{⊤} u > w (E_{1}), \\ ϕ (x_{b o d}^{t} + δ u) = - 1 when \nabla C S {(x_{b o d}^{t})}^{⊤} u < - w (E_{3}) . \end{matrix}

(A29)

Substituting Equation (A29) into Equation (A28) yields

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = E [u ∣ E_{1}] \cdot \frac{1 - p}{2} + E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{2}] \cdot p \\ - E [u ∣ E_{3}] \cdot \frac{1 - p}{2} . \end{matrix}

(A30)

Next, expand u in the orthogonal basis:

\begin{matrix} u = \sum_{i = 1}^{d} β_{i} v_{i} . \end{matrix}

(A31)

By symmetry, when

i \neq 1

,

E [β_{1} ∣ E_{1}] = E [β_{3} ∣ E_{3}] = 0

. Thus,

\begin{matrix} E [u ∣ E_{1}] = E [β_{1} v_{1} ∣ E_{1}] = E [β_{1} ∣ E_{1}] v_{1} = μ v_{1}, \\ E [u ∣ E_{3}] = E [β_{1} v_{1} ∣ E_{3}] = E [β_{1} ∣ E_{3}] v_{1} = - μ v_{1}, \end{matrix}

(A32)

where

μ = E [β_{1} ∣ E_{1}] > 0

and

E [β_{1} ∣ E_{3}] = - μ < 0

.

Substituting Equation (A32) into Equation (A30), we have

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = μ v_{1} \cdot \frac{1 - p}{2} + p \cdot E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{2}] + μ v_{1} \cdot \frac{1 - p}{2} \\ = μ v_{1} (1 - p) + p \cdot E [ϕ (x_{t} + δ u) u ∣ E_{2}] . \end{matrix}

(A33)

We notice that

\begin{matrix} μ v_{1} = E [β_{1} v_{1} ∣ E_{1}] = E [- β_{1} v_{1} ∣ E_{3}] . \end{matrix}

(A34)

Therefore,

μ v_{1}

can be expressed as

\begin{matrix} μ v_{1} = \frac{1}{2} (E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}]) . \end{matrix}

(A35)

Similarly, we can decompose

μ v_{1} (1 - p)

as

\begin{matrix} μ v_{1} (1 - p) = μ v_{1} - μ v_{1} p = [E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}]] \\ - p \cdot \frac{1}{2} (E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}]) . \end{matrix}

(A36)

Substituting Equations (A35) and (A36) into Equation (A33), we have

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = p \cdot E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{2}] + [E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}]] \\ - p \cdot \frac{1}{2} (E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}]) . \end{matrix}

(A37)

Rewrite Equation (A37) as

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = p \cdot (E [ϕ (x_{b o d}^{t} + δ u) u ∣ E_{2}] - \frac{1}{2} E [β_{1} v_{1} ∣ E_{1}] - \frac{1}{2} E [- β_{1} v_{1} ∣ E_{3}]) \\ + E [β_{1} v_{1} ∣ E_{1}] + E [- β_{1} v_{1} ∣ E_{3}] . \end{matrix}

(A38)

Next, we define the upper bound on bias as

\begin{matrix} {∥E [ϕ (x_{b o d}^{t} + δ u) u] - E [| β_{1} | v_{1}]∥}_{2} \leq 3 p . \end{matrix}

(A39)

Suppose that

a = E [ϕ (x_{b o d}^{t} + δ u) u]

and

b = E [| β_{1} | v_{1}] = E [| β_{1} |] v_{1}

. Since b is identical to

v_{1}

and

v_{1}

represents the gradient direction, we have

\begin{matrix} cos ∠ (a, \nabla C S (x_{b o d}^{t})) = cos ∠ (a, v_{1}) = \frac{a \cdot v_{1}}{{∥ a ∥}_{2}} . \end{matrix}

(A40)

Next, decompose a into the vector

a_{‖}

parallel to

v_{1}

and the vector

a_{⊥}

perpendicular to

v_{1}

:

\begin{matrix} a = a_{‖} + a_{⊥}, a_{‖} = (a \cdot v_{1}) v_{1}, a_{⊥} = a - a_{‖} . \end{matrix}

(A41)

Given

{∥ a - b ∥}_{2} \leq 3 p

and

b = E [| β_{1} |] v_{1}

, if we set

a v_{1} = α

, then

a_{‖} = α v_{1}

. Thus,

\begin{matrix} {∥ a - b ∥}_{2}^{2} = ({∥α - E [| β_{1} |]) v_{1} + a_{⊥}∥}_{2}^{2} = (α - E [| β_{1} {|])}^{2} + {∥ a_{⊥} ∥}_{2}^{2} \leq 9 p^{2} . \end{matrix}

(A42)

Therefore,

\begin{matrix} | α - E [| β_{1} |] | \leq 3 p, ∥ a_{⊥} ∥_{2} \leq 3 p . \end{matrix}

(A43)

Since

a v_{1} = α

, we have

\begin{matrix} {∥ a |}_{2}^{2} = α^{2} + {∥ a_{⊥} |}_{2}^{2} . \end{matrix}

(A44)

Thus,

\begin{matrix} cos ∠ (a, v_{1}) = \frac{α}{\sqrt{α^{2} + {∥ a_{⊥} ∥}_{2}^{2}}} . \end{matrix}

(A45)

From

| α - E [| β_{1} |] | \leq 3 p

, we have

α \geq E [| β_{1} |] - 3 p

. Combined with

∥ a_{⊥} ∥_{2} \leq 3 p

, we further have

\begin{matrix} cos ∠ (a, v_{1}) \geq \frac{E [| β_{1} |] - 3 p}{\sqrt{(E [| β_{1} {|] - 3 p)}^{2} + {(3 p)}^{2}}} . \end{matrix}

(A46)

If we set

r = \frac{3 p}{E [| β_{1} |]}

, we have

\begin{matrix} cos ∠ (a, v_{1}) \geq \frac{1 - r}{\sqrt{{(1 - r)}^{2} + r^{2}}} = \frac{1 - r}{\sqrt{1 - 2 r + 2 r^{2}}} . \end{matrix}

(A47)

Combining with the previous Lemma 2, we have

\begin{matrix} cos ∠ (a, v_{1}) \geq 1 - \frac{1}{2} r^{2} . \end{matrix}

(A48)

Thus,

\begin{matrix} cos ∠ (a, v_{1}) \geq 1 - \frac{1}{2} (\frac{3 p}{E [| β_{1} |]})^{2} . \end{matrix}

(A49)

As a result,

\begin{matrix} cos ∠ (E [ϕ (x_{b o d}^{t} + δ u) u], \nabla C S (x_{b o d}^{t})) \geq 1 - \frac{1}{2} {(\frac{3 p}{E | β_{1} |})}^{2} . \end{matrix}

(A50)

Note that

< \frac{\nabla S (x_{t})}{∥ S (x_{t}) ∥_{2}}, u >^{2}

follows a Beta distribution with the parameter

B (\frac{1}{2}, \frac{d - 1}{2})

. Therefore,

\begin{matrix} p = P (< \frac{\nabla C S (x_{b o d}^{t})}{∥ C S (x_{b o d}^{t}) ∥_{2}}, u >^{2} \leq \frac{w^{2}}{∥ C S (x_{b o d}^{t}) ∥_{2}^{2}}) \leq \frac{2 w}{B (\frac{1}{2}, \frac{d - 1}{2}) {∥ \nabla C S (x_{b o d}^{t}) ∥}_{2}} . \end{matrix}

(A51)

Combining with Equation (A50), we have

\begin{matrix} cos ∠ (E [ϕ (x_{b o d}^{t} + δ u) u], \nabla C S (x_{b o d}^{t})) \geq 1 - \frac{18 w^{2}}{(E | β_{1} {|)}^{2} B {(\frac{1}{2}, \frac{d - 1}{2})}^{2} {∥ \nabla C S (x_{b o d}^{t}) ∥}_{2}^{2}} \\ = 1 - \frac{9 L^{2} δ^{2} {(d - 1)}^{2}}{8 ∥ \nabla C S (x_{b o d}^{t}) ∥_{2}^{2}} . \end{matrix}

(A52)

Furthermore, we have

\begin{matrix} E [ϕ (x_{b o d}^{t} + δ u) u] = E [\tilde{\nabla C S} (x_{b o d}^{t})] . \end{matrix}

(A53)

Therefore,

\begin{matrix} cos ∠ (E [ϕ (x_{b o d}^{t} + δ u) u], \nabla C S (x_{b o d}^{t})) \geq 1 - \frac{9 L^{2} δ^{2} {(d - 1)}^{2}}{8 ∥ \nabla C S (x_{b o d}^{t}) ∥_{2}^{2}} . \end{matrix}

(A54)

When

δ \to 0

, we have

\begin{matrix} lim_{δ \to 0} cos ∠ (E [\tilde{\nabla C S} (x_{b o d}^{t})], \nabla C S (x_{b o d}^{t})) = 1 . \end{matrix}

(A55)

From Equation (A54), it can be seen that as d increases, the accuracy of the estimated gradient gradually decreases. Therefore, the high pixel count of videos indeed poses significant challenges for DBA. We also realize this problem and plan to introduce the keyframe technology in the future to reduce the pixel count and improve the algorithm’s performance.

References

Zhang, Y.; Chen, Z. A New Architecture of Neural Network. In Proceedings of the 1991 IEEE International Joint Conference on Neural Networks, Singapore, 18–21 November 1991; Volume 1, pp. 833–838. [Google Scholar]
Özgür, A.; Nar, F. Effect of Dropout Layer on Classical Regression Problems. In Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 5–7 October 2020; pp. 1–4. [Google Scholar]
Khedr, Y.M.; Liu, X.; Lu, H.; He, K. Transferable Adversarial Attacks against Face Recognition Using Surrogate Model Fine-Tuning. Appl. Soft Comput. 2025, 174, 112983. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, L.; Xu, X.; Wu, J.; Liu, Z. Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey. ACM Comput. Surv. 2025, 58, 52. [Google Scholar] [CrossRef]
Zheng, M.; Yan, X.; Zhu, Z.; Chen, H.; Wu, B. BlackboxBench: A Comprehensive Benchmark of Black-Box Adversarial Attacks. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 7867–7885. [Google Scholar] [CrossRef]
Ran, Y.; Zhang, A.-X.; Li, M.; Tang, W.; Wang, Y.-G. Black-box adversarial attacks against image quality assessment models. Expert Syst. Appl. 2025, 260, 125415. [Google Scholar] [CrossRef]
BenSaid, E.; Neji, M.; Jabberi, M.; Alimi, A.M. Deep keypoints adversarial attack on face recognition systems. Neurocomputing 2025, 621, 129295. [Google Scholar] [CrossRef]
Cui, J.; Gao, S.; Lv, T.; Ji, J.; Yao, S.; Zhou, W. Dual-label guided unrestricted target attack with diffusion model. Neurocomputing 2025, 665, 132185. [Google Scholar] [CrossRef]
Dong, Y.; Wang, L.; Li, Z.; Li, H.; Tang, P.; Hu, C.; Guo, S. Safe Driving Adversarial Trajectory Can Mislead: Toward More Stealthy Adversarial Attack Against Autonomous Driving Prediction Module. ACM Trans. Priv. Secur. 2025, 28, 19. [Google Scholar] [CrossRef]
Wang, J.; Li, F.; He, L. A Unified Framework for Adversarial Patch Attacks Against Visual 3D Object Detection in Autonomous Driving. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4949–4962. [Google Scholar] [CrossRef]
Chen, G.; Qian, Z.; Zhang, D.; Qiu, S.; Zhou, R. Enhancing Robustness Against Adversarial Attacks in Multimodal Emotion Recognition With Spiking Transformers. IEEE Access 2025, 13, 34584–34597. [Google Scholar] [CrossRef]
Ma, J.; Li, Y.; Xiao, Z.; Cao, A.; Zhang, J.; Ye, C.; Zhao, J. Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models. In Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3141–3157. [Google Scholar] [CrossRef]
Asimopoulos, D.C.; Radoglou–Grammatikis, P.; Lagkas, T.; Argyriou, A.; Moscholios, I.; Cani, J. AAG: Adversarial Attack Generator for Evaluating the Robustness of Machine Learning Models against Adversarial Attacks. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 2682–2689. [Google Scholar]
Duan, M.; Qin, Y.; Deng, J.; Li, K.; Xiao, B. Dual Attention Adversarial Attacks with Limited Perturbations. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 13990–14004. [Google Scholar] [CrossRef]
Jain, S.; Dutta, T. Towards Understanding and Improving Adversarial Robustness of Vision Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–18 June 2024; pp. 24736–24745. [Google Scholar]
Chen, Z.; Li, B.; Wu, S.; Jiang, K.; Ding, S.; Zhang, W. Content-Based Unrestricted Adversarial Attack. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Ilyas, A.; Engstrom, L.; Athalye, A.; Lin, J. Black-Box Adversarial Attacks with Limited Queries and Information. arXiv 2018, arXiv:1804.08598. [Google Scholar] [CrossRef]
Chen, J.; Jordan, M.I. HopSkipJumpAttack: A Query-Efficient Decision-Based Attack. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 1277–1294. [Google Scholar]
Wang, J.; Li, F.; Lv, S.; He, L.; Shen, C. Physically Realizable Adversarial Creating Attack Against Vision-Based BEV Space 3D Object Detection. IEEE Trans. Image Process. 2025, 34, 538–551. [Google Scholar] [CrossRef]
Rahman, M.; Roy, P.; Frizell, S.S.; Qian, L. Evaluating Pretrained Deep Learning Models for Image Classification Against Individual and Ensemble Adversarial Attacks. IEEE Access 2025, 13, 35230–35242. [Google Scholar] [CrossRef]
Song, Y.; Zhou, Z.; Li, M.; Wang, X.; Zhang, H.; Deng, M.; Wan, W.; Hu, S.; Zhang, L.Y. PB-UAP: Hybride Universal Adversarial Attack for Image Segmentation. In Proceedings of the ICASSP 2025-IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Liu, Z.; Wu, X.; Wang, S.; Shang, Y. Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning. IEEE Signal Process. Lett. 2024, 31, 476–480. [Google Scholar] [CrossRef]
Qasim, I.; Horsch, A.; Prasad, D. Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols. ACM Comput. Surv. 2025, 57, 154. [Google Scholar] [CrossRef]
Yang, Z.; Miao, J.; Wei, Y.; Wang, W.; Wang, X.; Yang, Y. Scalable Video Object Segmentation with Identification Mechanism. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6247–6262. [Google Scholar] [CrossRef]
Della Torca, S.; Casola, V.; Izzo, S. N-Pixels: A Novel Grey-Box Adversarial Attack for Fooling Convolutional Neural Networks. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, SAC ’25, Catania, Italy, 31 March–4 April 2025; pp. 1539–1547. [Google Scholar]
Zeng, Q.; Wang, Z.; Cheung, Y.-m.; Jiang, M. Ask, Attend, Attack: An Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA, 2025; Curran Associates Inc.: Red Hook, NY, USA, 2025. [Google Scholar]
Song, W.; Cong, C.; Zhong, H.; Xue, J. Correction-Based Defense against Adversarial Video Attacks via Discretization-Enhanced Video Compressive Sensing. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 3603–3620. [Google Scholar]
Pan, Y.; Huang, J.-J.; Chen, Z.; Zhao, W.; Wang, Z. SVASTIN: Sparse Video Adversarial Attack via Spatiotemporal Invertible Neural Networks. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Jiang, L.; Ma, X.; Chen, S.; Bailey, J.; Jiang, Y.-G. Black-Box Adversarial Attacks on Video Recognition Models. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, 21–25 October 2019; pp. 864–872. [Google Scholar]
Yan, H.; Wei, X. Efficient Sparse Attacks on Videos Using Reinforcement Learning. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, New York, NY, USA, 20–24 October 2021; pp. 2326–2334. [Google Scholar]
Jiang, K.; Chen, Z.; Huang, H.; Wang, J.; Yang, D.; Li, B.; Wang, Y.; Zhang, W. Efficient Decision-Based Black-Box Patch Attacks on Video Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 2–6 October 2023; pp. 4356–4366. [Google Scholar]
Li, H.; Xu, X.; Zhang, X.; Yang, S.; Li, B. QEBA: Query-efficient boundary-based blackbox attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1218–1227. [Google Scholar]
Zhang, J.; Li, L.; Li, H.; Zhang, X.; Yang, S.; Li, B. Progressive-scale boundary blackbox attack via projective gradient estimation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 12479–12490. [Google Scholar]
Hu, J.; Li, X.; Liu, C.; Zhang, R.; Tang, J.; Sun, Y.; Wang, Y. APDL: An adaptive step size method for white-box adversarial attacks. Complex Intell. Syst. 2025, 11, 116. [Google Scholar] [CrossRef]
Croce, F.; Hein, M. Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-Free Attacks. In Proceedings of the International Conference on Machine Learning, Virtual Conference, 13–18 July 2020; pp. 2206–2216. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A Large Video Database for Human Motion Recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Liu, J.; Zhang, C.; Lyu, X. Boosting the Transferability of Adversarial Examples via Local Mixup and Adaptive Step Size. In Proceedings of the ICASSP 2025-IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Li, J.; Yu, Z.; He, Z.; Wang, Z.J.; Kang, X. PGD-Imp: Rethinking and Unleashing Potential of Classic PGD with Dual Strategies for Imperceptible Adversarial Attacks. In Proceedings of the ICASSP 2025-IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Zhao, J.-C.; Ding, J.; Sun, Y.-Z.; Tan, P.; Ma, J.-E.; Fang, Y.-T. Avoiding catastrophic overfitting in fast adversarial training with adaptive similarity step size. PLoS ONE 2025, 20, e0317023. [Google Scholar] [CrossRef]
Gotin, G.; Shumitskaya, E.; Antsiferova, A.; Vatolin, D. Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics. arXiv 2025, arXiv:2501.08415. [Google Scholar]
Wei, Z.; Chen, J.; Wu, Z.; Jiang, Y.-G. Adaptive Cross-Modal Transferable Adversarial Attacks from Images to Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3772–3783. [Google Scholar] [CrossRef] [PubMed]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
Hara, K.; Kataoka, H.; Satoh, Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6546–6555. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Swizerland, 2016; pp. 20–36. [Google Scholar]
Vranjes, M.; Rimac-Drlje, S.; Grgic, K. Locally Averaged PSNR as a Simple Objective Video Quality Metric. In Proceedings of the 2008 50th International Symposium ELMAR, Zadar, Croatia, 10–12 September 2008; pp. 17–20. [Google Scholar]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing Properties of Neural Networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014; Bengio, Y., Le Cun, Y., Eds.; ICLR: Banff, AB, Canada, 2014. [Google Scholar]
Wang, R.; Guo, Y.; Wang, Y. Global-Local Characteristic Excited Cross-Modal Attacks from Images to Videos. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI’23, Washington, DC, USA, 7–14 February 2023; AAAI Press: Washington, DC, USA, 2023. [Google Scholar]
Chen, K.; Wei, Z.; Chen, J.; Wu, Z.; Jiang, Y.-G. GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, New York, NY, USA, 29 October–3 November 2023; pp. 698–708. [Google Scholar]

Figure 1. The framework of DNNs and the workflow of adversarial examples (the plus symbol indicates adding perturbations to the clean sample; the first arrow denotes the generation of the adversarial example; the second arrow represents inputting the adversarial example into the target model; the third arrow shows the target model’s classification result for the adversarial example; the red font highlights the incorrect classification result).

Figure 2. Different accessible components of the target model for the white-box setting, the score-based black-box setting, and the hard-label black-box setting (the first arrow represents inputting the video sample into the target model; the second arrow shows the target model’s classification result for the video sample).

Figure 3. Framework of DBA (arrows indicate the algorithm’s execution process; the pink rectangle denotes the gradient estimation process; the red text is used to indicate misclassification results).

Figure 4. Overview of the decision boundary.

Figure 5. MSE on different target models within 30,000 model queries. (a) MSE on C3D within 30,000 model queries; (b) MSE on TSN within 30,000 model queries.

Figure 6. PSNR on different target models within 30,000 model queries. (a) PSNR on C3D within 30,000 model queries; (b) PSNR on TSN within 30,000 model queries.

Figure 7. AOA on different target models within 30,000 model queries. (a) AOA on C3D within 30,000 model queries; (b) AOA on TSN within 30,000 model queries.

Figure 8. SRT on different target models within 30,000 model queries when MSE ≤ 25. (a) SRT on C3D within 30,000 model queries when MSE ≤ 25; (b) SRT on TSN within 30,000 model queries when MSE ≤ 25.

Figure 9. SRT under different model queries and MSE thresholds on C3D. (a) SRT on C3D when MSE ≤ 5; (b) SRT on C3D when MSE ≤ 10; (c) SRT on C3D when MSE ≤ 15; (d) SRT on C3D when MSE ≤ 20; (e) SRT on C3D when MSE ≤ 25.

Figure 10. SRT under different model queries and MSE thresholds on TSN. (a) SRT on TSN when MSE ≤ 5; (b) SRT on TSN when MSE ≤ 10; (c) SRT on TSN when MSE ≤ 15; (d) SRT on TSN when MSE ≤ 20; (e) SRT on TSN when MSE ≤ 25.

Figure 11. Results for the dynamic step size adjustment strategy. (a) SRT comparison after removing DBA’s step size strategy; (b) SRT for different

v_{i n i}

within 400,000 model queries when MSE ≤ 25 (the dots denote SRT for different

v_{i n i}

).

Figure 11. Results for the dynamic step size adjustment strategy. (a) SRT comparison after removing DBA’s step size strategy; (b) SRT for different

v_{i n i}

within 400,000 model queries when MSE ≤ 25 (the dots denote SRT for different

v_{i n i}

).

Figure 12. Results for the invalid-sample skipping strategy. (a) Query attempt comparison after removing the invalid-sample skipping strategy; (b) query attempt comparison when b is set to different values.

Table 1. Explanation of symbols.

Serial Number	Symbol	Description
1	$x_{b o d}^{t}$	The boundary video generated in the t-th iteration
2	$x_{a d v}^{t}$	The adversarial video generated in the t-th iteration
3	$x_{t g t}$	The target video
4	$C S$	The probability difference function
5	$ϕ$	The sign function
6	u	The sampled perturbation
7	$δ$	A small constant
8	L	The Lipschitz continuity coefficient of $C S$
9	$\nabla C S$	The true gradient of $C S$
10	$\tilde{\nabla C S}$	The estimated gradient of $C S$
11	$E$	The expectation function
12	d	The number of pixels of videos
13	$P$	The probability function
14	$B$	The Beta distribution
15	$x^{'}$	The video between $x_{b o d}^{t}$ and $x_{b o d}^{t} + δ u$
16	$β$	The uniform distribution
17	$E_{i} (i = 1, 2, 3)$	The three subregions of the spherical region
18	$v_{1}, \dots, v_{d}$	The orthogonal basis

Table 2. Metrics on C3D within 30,000 model queries (the best results are in bold).

Algorithm	MQN	MSE	PSNR	AOA
DBA	29,910	20.8	35.0	0.08
STDE	29,954	90.1	28.6	0.33
EARL	-	-	-	-
VBAD	-	-	-	-

Table 3. Metrics on TSN within 30,000 model queries (the best results are in bold).

Algorithm	MQN	MSE	PSNR	AOA
DBA	9870	14.6	36.5	0.15
STDE	29,875	39.2	32.2	0.40
EARL	-	-	-	-
VBAD	11,750	14.6	36.5	0.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jing, Y.; Wu, L.; Su, K.; Wu, W.; Li, Z.; Deng, Q. Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models. Mathematics 2026, 14, 1016. https://doi.org/10.3390/math14061016

AMA Style

Jing Y, Wu L, Su K, Wu W, Li Z, Deng Q. Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models. Mathematics. 2026; 14(6):1016. https://doi.org/10.3390/math14061016

Chicago/Turabian Style

Jing, Yulin, Lijun Wu, Kaile Su, Wei Wu, Zhiyuan Li, and Qi Deng. 2026. "Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models" Mathematics 14, no. 6: 1016. https://doi.org/10.3390/math14061016

APA Style

Jing, Y., Wu, L., Su, K., Wu, W., Li, Z., & Deng, Q. (2026). Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models. Mathematics, 14(6), 1016. https://doi.org/10.3390/math14061016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generating Hard-Label Black-Box Adversarial Examples for Video Recognition Models

Abstract

1. Introduction

2. Proposed Framework

2.1. Finding a Boundary Video at the Decision Boundary

2.2. Calculating the Next Movement Direction by Estimating the Gradient

2.3. Moving Along the Estimated Gradient Direction

3. Experiments

3.1. Algorithm Competitors

3.2. Experimental Settings

3.3. Results and Analysis

3.3.1. Magnitude of Perturbations

3.3.2. Success Rate

3.3.3. Ablation Studies

3.4. Failure Cases and Future Directions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Lemma 1

Appendix B. Proof of Lemma 2

Appendix C. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI