An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights

Yan, Fanfei; Sun, Chenhan; Huang, Yuhan; Guo, Jian; Ren, Hengyi

doi:10.3390/electronics14193888

Open AccessArticle

An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights

by

Fanfei Yan

¹,

Chenhan Sun

¹,

Yuhan Huang

¹,

Jian Guo

^1,*

and

Hengyi Ren

^2,3

¹

School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

²

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

³

College of Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3888; https://doi.org/10.3390/electronics14193888

Submission received: 26 July 2025 / Revised: 25 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025

Download

Browse Figures

Versions Notes

Abstract

High-performance deep learning models require extensive computational resources and datasets, making their ownership protection a pressing concern. To address this challenge, we focus on advancing model security through robust watermarking mechanisms. In this work, we propose a novel deep neural network watermarking method that embeds ownership information directly within the image feature space. Unlike existing approaches that often suffer from low embedding success rates and significant performance degradation, our method leverages convolutional kernels with orthogonal preferences to extract multiperspective features, which are then linearly mapped at the output layer for watermark embedding. Furthermore, we introduce an orthogonal regularization constraint into the loss function to increase the watermark robustness. This constraint enforces orthogonality in both convolutional and fully connected layer weights, suppresses redundancy in hidden layer representations, and minimizes interference between the watermark and the model’s original feature space. Through these innovations, we significantly improve the embedding reliability and preserve model integrity. Experimental results obtained on ResNet-18 and ResNet-101 demonstrate a 100% watermark detection rate with less than 1% performance impact, underscoring the practical security value of our approach. Comparative analysis further validates that our method achieves superior harmlessness and effectiveness relative to state-of-the-art techniques. These contributions highlight the role of our work in strengthening intellectual property protection and the trustworthy deployment of deep learning models.

Keywords:

model watermark; orthogonality; convolution

1. Introduction

In this section, we define the problem and underlying motivation, provide an overview of our proposed method and key contributions, and position our work within the broader contexts of model security and data privacy. We further articulate the central research questions and objectives that guide this study.

Deep neural networks (DNNs) have shown outstanding performance in recent years. They excel in computer vision [1,2,3], natural language processing [4,5,6], and recommendation systems [7,8,9]. These networks are now widely applied across various domains. Training an effective DNN requires large-scale datasets in practice. It also demands extensive computational power and significant human resources. Protecting these costly models has become an urgent need [10,11], as unauthorized copying, tampering, and misuse must be prevented. Model watermarking is a key ownership protection technology. Its fundamental idea is simple. We embed a watermark that is difficult to detect and reverse into the target model. Ownership verification involves extracting the watermark using a specific method. The extracted watermark is then compared with the pre-embedded watermark.

Currently, two primary approaches are available for embedding watermarks in models: white-box and black-box watermarking. The core idea of white-box watermarking is to embed watermark information directly into the internal parameters of the target model without affecting its original functionality. Notable works in this area include DeepMarks [12], Riga [13], and DeepIPR [14]. The advantage of white-box techniques is their minimal impact on the performance of the original model. However, they require full access to the internals of that model. In practical scenarios, most deep learning models are provided as online services, where third-party users can access their prediction results via only an application programming interface (API) and have no direct access to their internal parameters. This severely limits the applicability of white-box methods. In contrast, black-box watermarking enables effective verifications to be performed with only API-level access, thus aligning better with the prevalent service-oriented deployment models [13] and garnering more attention from both academia and industry. Black-box watermarking typically involves implanting a “backdoor” trigger during training, causing the model to produce a predetermined, anomalous output for specific inputs, thereby marking the model [15]. Adi et al. [16] used backdoor attacks to force a deep learning model to memorize specific patterns, representing an early and influential 0-bit black-box watermarking method. Le Merrer et al. [17] proposed embedding a watermark by fine-tuning a local region of the decision boundary. Shao et al. [18] proposed embedding a multibit watermark in the feature space of the target model. The limitation of black-box watermarking is the uncertainty inherent in its embedding process, which often makes it difficult to guarantee stable watermark activation effects across all input conditions. Furthermore, specifically designed trigger samples can perturb the original decision boundary of the model, leading to incorrect predictions for normal inputs and thus degrading the performance achieved in the primary task.

To address the aforementioned limitations, a model watermarking method based on an orthogonal feature space is introduced in this paper. The core idea is to enforce an orthogonality constraint between the watermark-related features and the original task-decision features of the examined model, promoting a high degree of linear independence between them in the feature space. This strategy is inspired by the effectiveness of orthogonality in information representation separation and task decoupling scenarios. This helps minimize the interference imposed by the watermarking process on the primary task performance of the model, thereby increasing the watermark embedding success rate while reducing the induced performance loss. The main contributions of this paper are as follows.

We propose a harmless black-box watermarking method named orthogonal feature space watermarking (OFSW). In this method, we transform the feature representations of specific trigger samples into a watermark by adding a watermark-related constraint to the loss function. Simultaneously, we introduce an orthogonal regularization term to the loss function, which is designed to maintain orthogonality between the watermark features and the original task features of the target model. Owing to this regularization scheme, the watermark embedding process minimally interferes with the ability of the model to classify normal samples, thus preserving its performance in the primary task.
We apply an orthogonalization-promoting constraint to the parameter matrices of the model. This strategy further reduces the impact of OFSW on the standard predictions yielded by the model while simultaneously improving the success rate of watermark embedding.
We conduct extensive experiments on the ResNet-18 and ResNet-101 models, comparing OFSW with the existing watermarking techniques. The results demonstrate that OFSW has significant advantages in terms of both watermark effectiveness and its harmlessness to the target model.

2. Related Works

The existing model watermarking methods can be categorized into two types: white-box methods and black-box methods. This section provides a systematic review of white-box and black-box watermarking techniques as well as multibit schemes and their limitations, thereby establishing the research background.

White-box watermarking methods assume full access to the target model, including its architecture, parameters, and activation maps, during both the embedding and verification process. When a watermark is embedded into a DNN, the model owner typically modifies the model parameters directly to insert the watermark [10,19]. For example, Uchida et al. [12] proposed a method that embeds a watermark by fine-tuning the target model with a watermark regularization term in the loss function. Watermarks can also be embedded by adjusting the model architecture [19,20], embedding external features [21], introducing a transposed model [22], using activation maps [13,23], or adding passport layers [10,23]. Similarly, during the verification process, the verifier is assumed to have full access to the model parameters. However, this assumption is often impractical in real-world applications, as most models are accessed via APIs. Therefore, the applicability of white-box methods is severely limited.

Black-box watermarking methods assume that only the output of a suspicious model can be observed during verification. They do not require direct access to the internal structure of the model. These methods are typically implemented using backdoor attack mechanisms [24,25]. The model owner implants a set of “trigger samples” during the training phase. This causes the DNN to produce a predefined, anomalous output when it encounters these specific triggers [26,27]. These trigger samples are proprietary and confidential data. Only the model owner knows them. The trigger set is fed to the model to verify its ownership. If the model produces the expected exclusive response, ownership is confirmed. Black-box methods have excellent task adaptability and deployment flexibility. They have been widely used in traditional image classification scenarios [26,28] and have also been successfully extended to other domains, including federated learning [29,30], text generation [31,32], and prompt engineering [33,34]. They achieve security goals such as ownership verification, infringement detection, and accountability.

Black-box watermarking methods can be classified into zero-bit and multibit watermarking methods. Their classification processes are based on the amount of information that is embedded. Zero-bit methods only indicate the presence or absence of a watermark. They do not store additional information. The developed model is trained to produce a fixed, anomalous response to a specific trigger set during the embedding procedure [35]. The presence of the watermark is confirmed during verification if the misclassification rate exceeds a predefined threshold [36]. Zero-bit schemes are simple to implement. They have low costs and are widely applicable. Their main weaknesses include their inability to carry identity information [37] and their vulnerability to adversarial attacks [31]. These weaknesses require mitigation through other techniques. Multibit methods aim to embed a string of information. This typically includes copyright identifiers such as digital signatures or owner identities. BlackMarks [38] encodes a bit value (0 or 1) for each possible output class. It then generates a set of key-image–label pairs on the basis of a predetermined binary signature. The target model is fine-tuned to embed the specific behaviours that are associated with these pairs. Explanation as a Watermark (EaaW) [18] embeds a multibit watermark into the feature space of the model. It does not alter the original predictions produced by the model for those samples. The watermark is embedded in the explanation output of the model, making this approach both stealthy and harmless. Compared with zero-bit watermarking, multibit methods offer significant advantages in terms of functionality and security. However, the existing research faces limitations and challenges. Improving the success rate of watermark embedding often results in reduced accuracy on the primary task. This necessitates a trade-off between the two goals [39].

In this paper, we embed a multibit watermark in the feature space of a model to address these limitations. We introduce an orthogonal regularization term that forces the watermark to be nearly orthogonal to the primary task direction. It minimizes the interference with the decision boundary. This approach ensures reliable watermark embedding and preserves the normal predictive performance of the developed model.

3. Algorithm

This chapter details the design philosophy, overall framework, and specific implementation of OFSW. In this section, we present and formalize our OFSW algorithm and introduce its core methodology.

3.1. Overall Framework of OFSW

This section decomposes and elucidates the three modules of OFSW as well as the statistical methods for orthogonality.

The basic idea of the OFSW model is as follows. We represent a machine learning model as spanning a function space, and using orthogonalization, we identify an orthogonal complement space in which we embed our watermark. Owing to the orthogonality between the complement space and the function space, modifying the position of an input variable in the orthogonal complement space does not alter its position in the function space, thereby effectively reducing the impact on the model itself that commonly arises in black-box watermarking techniques during the embedding process. In the watermark embedding stage, we transform the watermark into different feature representations within the feature space. To achieve this, we employ convolutional kernels derived from multiple perspectives. Unlike conventional approaches, we do not directly modify the labels of trigger samples in watermark classes, which could mislead the model. Instead, we introduce a kernel orthogonalization module to mitigate the impact of watermark embedding on model performance. This module applies orthogonal constraints to the convolutional kernels and fully connected layer weights during training.

The OFSW framework consists of three main processes: watermark embedding, watermark extraction, and identity verification. Figure 1 provides a brief illustration of this workflow. Watermark embedding involves adding specific and imperceptible identification information to the target model, which serves as proof of its ownership or origin. Watermark extraction analyses the model’s inputs and outputs to identify and verify the embedded watermark information, thereby confirming the model’s provenance. Identity verification uses the extracted watermark to validate the model’s legitimacy and the identity of its rightful owner, ensuring that the model is not misused or tampered with without authorization.

The OFSW framework is composed of three core modules: (1) the orthogonalization module, (2) the watermark interpretation module, and (3) the watermark comparison module.

During the embedding stage, the orthogonalization module and the watermark interpretation module provide critical gradient information, leading to the construction of a watermarked model whose parameter matrix spans a function space orthogonal to that spanned by the predefined convolutional kernels.

In the extraction stage, we input the predefined trigger samples into the watermarked model. The outputs, together with the predefined orthogonal convolutional kernels, are fed into the watermark interpretation module to compute the watermark. Specifically, the outputs serve as y, and the orthogonal convolutional kernels, once expanded and concatenated, form x. The watermark is then obtained as the weight matrix from the ridge regression of x and y, where negative coefficients represent 0-bits and positive coefficients represent 1-bits.

In the verification stage, the watermark comparison module calculates the similarity between the extracted watermark and the originally embedded watermark, and the p-value of a chi-square distribution is used as the evaluation metric.

Next, the specific algorithms developed for these three components and the implementation details of OFSW are elaborated upon.

3.2. Watermark Embedding

This section presents the multi-objective loss function (including the soft orthogonal regularization and embedding terms) and the training procedure and defines the embedding mechanism.

Figure 2 illustrates the workflow of the proposed method. For a given target model, we first complete its standard training process. A set of trigger samples is then selected from the dataset, serving as the foundation for subsequent ownership verification. If the model watermark is considered a lock imposed on the model, the trigger samples can be viewed as the key to that lock. During the watermark extraction stage, which is used for ownership authentication, both the trigger samples and the extraction method must be provided to retrieve the embedded watermark from the model. We then adopt a multi-objective optimization strategy that simultaneously minimizes the prediction loss induced by training samples. Orthogonal constraints are applied in the feature space to reserve a subspace orthogonal to the feature directions of the trigger samples while maximizing the projection of the watermark vector onto this orthogonal subspace. This ensures that the embedded watermark can be reliably extracted.

To compute the k-bit watermark for embedding purposes, we design a watermark explanation module for the target model. First, for each trigger sample, we construct k mutually orthogonal convolutional kernels and perform multiperspective convolution on the input image to obtain feature representations corresponding to the dimensionality of the watermark. These features are then fed into the model to obtain a set of evaluation vectors that measure the importance of each perspective based on the degree of matching between the output and the true labels. Subsequently, we linearly fit the evaluation vectors and the convolutional features to obtain a weight matrix, which is then quantized to 0/1 on the basis of the sign of the weights (see Section 3.3 for calculation details). Finally, we save the model parameters after the watermark embedding, the trigger sample set, and the watermark vector to provide a complete basis for the subsequent watermark verification and ownership claim stages.

During the watermark embedding phase, the model owner embeds the watermark by fine-tuning a pretrained model. Concurrently, the owner must ensure that the performance of the model is minimally affected after the embedding process. To better balance the trade-off between the embedding success rate and model performance, the owner should strive to ensure that the feature space of the model and the watermark embedding space are mutually orthogonal.

Building upon the above objectives, we define the watermark embedding task as a multitask optimization problem with three goals: (1) preserving the original task of the target model; (2) enforcing orthogonality between the function space spanned by the parameter matrices of the target model and the function space spanned by our predefined convolutional kernels; and (3) ensuring that the weights obtained via ridge regression between the outputs of the trigger samples and our predefined kernels match the embedded watermark symbols. Based on these considerations, we propose a loss function, as shown in Equation (1).

\tilde{θ} = \arg \min L_{1} (M (X \cup X_{T}), Y \cup Y_{T}) + r_{1} \cdot L_{2} (orthogonality (θ)) + r_{2} \cdot L_{3} (extract (X_{T}, Y_{T}), W)

(1)

where θ represents the parameters of the target model, X denotes the clean samples, and X_T denotes the trigger samples. Y and Y_T are their corresponding labels. The orthogonality() function is used to evaluate the orthogonality of the parameters contained in the target model. The extract() function is used to extract the watermark from the target model. W is the embedded watermark, and r1 and r2 are coefficients.

Equation (1) consists of three parts.

The first part, L1, is the loss function of the initial deep neural network. This ensures that the predictions produced for both the clean dataset and the trigger set are consistent with their corresponding labels, thus preserving the functionality of the model.
The second part, L2, is aimed at promoting orthogonality among the model kernels. Intuitively, orthogonal kernels can better span the parameter space, especially in high-dimensional cases where the kernel dimensionality is greater than the number of kernels. Inspired by Ziming Zhang et al. [40], we first approximate the angles between the kernels in each hidden layer using the kernel responses of the model. We subsequently drive the mean and variance of these angles towards 90° and 0°, respectively. This serves as the orthogonal regularization term, as shown in Equation (2).

$L_{2} = \sum_{i} λ_{1} \cdot E_{ϑ ~ Θ_{i}} {(ϑ)}^{2} + λ_{2} \cdot E_{ϑ ~ Θ_{i}} (ϑ^{2})$

(2)

where $Θ_{i}$ represents the pool of kernel angles $ϑ$ in the i-th hidden layer and $λ_{1}$ and $λ_{2}$ represent the importance weights for each part. The first term is a weaker orthogonal regularization term, which is aimed at driving the mean angle between all pairs of weight matrices towards 90°. The second term is stricter, aiming to drive both the means and variances of the angles between all pairs of weight matrices towards 90° and 0°, respectively. Research conducted by Vorontsov et al. [41] indicates that imposing hard orthogonality constraints on neural networks can reduce their convergence speeds and harm their performance, whereas soft orthogonality constraints can improve the training process. Considering both accuracy and computational speed, we set $λ_{1} > > λ_{2}$ in this paper.
In our algorithm, we use Equation (3) to estimate ϑ. Due to the linear transformation step, the means and variances corresponding to 90° and 0° in the kernel angle space are mapped to 0 in the ϑ space.

$ϑ_{i, j} \approx \frac{1}{N} \sum_{α}^{N} [\tanh (γ \cdot x^{T} \cdot w_{i}) \cdot \tanh (γ \cdot x^{T} \cdot w_{j})] = [\tanh (\frac{γ \cdot y_{i}}{∥ y_{i} ∥_{2}})] [\tanh (\frac{γ \cdot y_{j}}{∥ y_{j} ∥_{2}})]$

(3)

where tanh() is an entrywise function, γ is a scalar parameter, w_i is the i-th kernel vector in the hidden layer, x^T is the input of the hidden layer, N is the number of layers contained in the model, w_i is the i-th kernel of the α-th layer of this model, and y_i is the output produced after the computation of the i-th kernel.
To establish this conclusion, we first prove a lemma.

Lemma 1 [

(ϑ - S p a c e)

].

Without loss of generality, let

θ \in [0, π]

be the angle between two vectors

w_{m}, w_{n} \in ℝ^{d}

and

X

be the unit ball in the d-dimensional space. We then have

ϑ = E_{x ~ X} [sgn (x^{⊤} w_{m}) sgn (x^{⊤} w_{n})] = 1 - \frac{2 θ}{π}

(4)

where

E

is the expectation operator, sample

x

is uniformly sampled from

X

, and

sgn : ℝ \to {\pm 1}

is the sign function returning 1 for positives and −1 otherwise.

Proof.

Let

u = w_{m} / ∥ w_{m} ∥

,

v = w_{n} / ∥ w_{n} ∥

. Since scaling does not change the sign,

sgn (x^{⊤} w_{m}) = sgn (x^{⊤} u), sgn (x^{⊤} w_{n}) = sgn (x^{⊤} v) .

(5)

Sampling uniformly from the unit ball

X

and choosing directions uniformly from the unit sphere

S^{d - 1}

are equivalent at the “sign” level (as the radial length does not affect the sign). Hence, this can be equivalently viewed as uniformly sampling

x

on

S^{d - 1}

.

Now, consider selecting a normal vector direction uniformly from the unit circle

ϕ \in [0, 2 π)

.The vectors

u

and

v

are separated by a hyperplane passing through the origin if and only if the normal vector falls within two arc segments of total length

2 θ

. Therefore,

ℙ_{x ~ X} [sgn (x^{⊤} w_{m}) = sgn (x^{⊤} w_{n})] = 1 - \frac{θ}{π} = \frac{1 + ϑ}{2}

(6)

which is equivalent to Equation (6). We then complete our proof. □

3.: The third part, L3, focuses on the watermark embedding success rate. Here, we choose a hinge-like loss, which is commonly used in computer vision tasks, as shown in Equation (7).

$L_{3} = \sum_{i = 1}^{k} \max (0, ϵ - \tilde{W_{i}} \cdot W_{i}) \lim_{x \to \infty}$

(7)

where $\tilde{W_{i}}$ represents the i-th bit of the watermark extracted from the model, $W_{i}$ represents the i-th bit of the watermark that we embed, k is the number of bits contained in the watermark, and ε is a control variable.

3.3. Watermark Extraction

This section introduces the linear interpretation module and ridge regression decoding, presents the corresponding pseudocode, and provides the decoding procedure for black-box watermark verification.

In this stage, we design a linear explanation module for extracting the watermark. We use k mutually orthogonal convolutional kernels to perform multiperspective convolution on the samples in the trigger set. This generates a representation dataset with the same dimensionality as that of the watermark. This representation dataset is then fed into the target model along with the true labels. We compute an evaluation vector that reflects the importance of each convolutional perspective. Next, we use ridge regression to fit the evaluation vector and the convolutional kernels. We binarize the resulting weight matrix. This finally yields our embedded k-bit binary watermark. Figure 3 shows the main flow of the algorithm. It consists of three steps: (1) constructing the representation dataset; (2) evaluating the model predictions; and (3) performing linear fitting.

The specific implementation is described below.

Constructing the Representation Dataset

For each image X_T in the trigger set, to extract a k-bit watermark, we need to obtain feature maps from k different perspectives. Therefore, we randomly generate k d-dimensional vectors and orthogonalize them to obtain k convolutional kernels

ϕ_{1 ~ k}

. Then, we perform a convolution operation on each image to obtain the representation dataset

X_{p} = ϕ^{*} X_{T}

. Thus, for each image included in the trigger set, we can obtain a

k \times n

representation dataset, where n is the dimensionality of the input vector employed by the target model.

2.: Generating Evaluation Vectors for the Model Predictions

We input the representation dataset X_p into the target model to obtain the prediction result p = M(X_p; θ), which is a

k \times m

matrix, where m is the length of the output vector. Then, we introduce an evaluation function, Evaluation(), as shown in Equation (8), to measure the consistency between this output vector and the target watermark vector.

Y_{W} = Evaluation (p, Y_{T}) = \sum_{j} p_{i j} 1 (Y_{T} = j) .

(8)

where

Y_{W}

is the evaluation vector,

Y_{T}

is the corresponding label for

X_{T}

,

1 (\cdot)

is the indicator function, j is the classification class, and

p_{i j}

is the j-th element of the model output for the i-th input image in the representation dataset.

3.: Explaining the Watermark

After calculating the evaluation vector Y_W, we need a linear fitting method to quantify the contribution of each representation image contained in the representation dataset. We use ridge regression to solve for the weights W. Treating Y_W as the response variable y and the stacked column vectors of the flattened orthogonal convolutional kernels as the independent variable matrix X, the weight W derived from the ridge regression equation is the embedded watermark information. The ridge regression equation is shown in Equation (9).

\tilde{W} = \arg \min {∥ Y_{W} - W \cdot X_{ρ} ∥}_{2} + λ \cdot {∥ W ∥}_{2}

(9)

where λ is a hyperparameter and W is the weight matrix obtained from

X_{p}

and

Y_{W}

through ridge regression. Finally, we extract the watermark from the model using Equation (10).

\tilde{W} = sgn ({({X_{p}}^{T} X_{p} + λ I)}^{- 1} {X_{p}}^{T} Y_{w})

(10)

where I is a

k \times k

identity matrix. Through these steps, we fit the linear relationship between

X_{p}

and

Y_{W}

, where the binarized weight matrix becomes our watermark. The pseudocode for the watermark extraction algorithm is given in Algorithm 1.

Algorithm 1. Watermark extraction algorithm.

Input: Trigger samples X_T., Y_T; predictions from model M; orthogonal convolutional kernels

ϕ

.

Output: A k-bit vector representing the extracted watermark

\tilde{W}

.

1: X_p. =

ϕ

* X_T.

2: p = M(X_p.; Θ)

3: Y_W = evaluation(p, Y_T)

4: W = (M^T M + λI)^−1 M^T · Y_W

5:

\tilde{W}

= zero_like(W)

6: for i = 0 to k − 1 do

7: if W_i ≥ 0 then

8:

\tilde{W}

_i = 1

9: else

10:

\tilde{W}

_i = 0

11: return

\tilde{W}

3.4. Identity Verification

This section presents a chi-square-test–based protocol for determining watermark consistency, thereby verifying ownership.

The final part of the OFSW method involves using the embedded and extracted watermarks to verify the identity of the model owner. When a model owner encounters a suspicious model

M

, they can use the trigger set and the set of orthogonal convolutional kernels

ϕ_{1 ~ k}

to extract a watermark

\tilde{W}

from the suspicious model

M

. To measure the similarity between

\tilde{W}

and the original watermark

W

, we use Pearson’s chi-square test [42]. If the resulting p-value is below a control parameter,

\tilde{W}

and

W

are considered the same watermark, and the model owner can claim that their model has been plagiarized or used without authorization.

4. Experimental Results

This chapter validates the performance of the proposed OFSW method through a series of experiments. The effectiveness and nonharmfulness of OFSW across multiple models, datasets, and triggers are evaluated.

4.1. Experimental Setup

This section describes the experimental setup, including the employed parameter settings and evaluation metrics.

We conducted experiments under two different model and dataset configurations. The first group used the widely applied ResNet-18 as the target model and the CIFAR-10 [43] dataset to evaluate the classification performance of the model. The second group was based on the deeper ResNet-101 model and used ImageNet-100 [44], a subset of ImageNet, for testing and validation purposes.

Four types of triggers were used to comprehensively validate the effectiveness of the proposed method, i.e., patterns, images, masks, and noise, as shown in Figure 4. All the experiments were conducted on a single NVIDIA 4090 GPU for 30 epochs. The other hyperparameter settings were as follows. In Equation (1), r1 = 2, and r2 = 1; in Equation (3), γ = 0.1; in Equation (9), λ = 0.1; and in Algorithm 2, α = 0.05. We used the “AI” image shown in Figure 4 as a prototype, proportionally scaling it to create trigger samples with corresponding sizes for the target bit lengths (64, 256, and 1024 bits).

Algorithm 2. Identity verification algorithm.

Input: Trigger samples X_T, Y_T; suspicious model M; original watermark W; control parameter α.

Output: A Boolean value indicating whether the watermark extracted from the suspicious model is identical to the original watermark.

1:

\tilde{W}

= extract(X_T, Y_T, Θ)

2:

\tilde{W}

= binary(

\tilde{W}

)

3: p-value = χ² − Test(

\tilde{W}

, W)

4: if p-value ≤ α then

5: return True

6: else

7: return False

We used the following three metrics for comparison purposes in this experiment.

Test Accuracy (Test Acc.): This metric measures the accuracy of the model predictions, as shown in Equation (11).

$T e s t A c c = \frac{1}{|X \cup X_{T}|} \sum_{\begin{array}{l} X_{i} \in X \cup X_{T} \\ Y_{i} \in Y \cup Y_{T} \end{array}} 1 (M (X_{i}) = Y_{i})$

(11)

where $1 (\cdot)$ is the indicator function, (x, y) are the inputs and labels derived from the clean test set, ( $X_{T}$ , $Y_{T}$ ) are the inputs and labels acquired from the trigger-embedded test set, and M is the model under evaluation.

2.: p-value of a Hypothesis Test: This metric is used to measure the statistical similarity between the extracted and embedded watermarks. A lower p-value indicates a more significant correspondence between the two watermarks, as shown in Equation (12).

$p - v a l u e = 1 - F_{X^{2}} (X^{2}; 1)$

(12)

where $F_{X^{2}} (X^{2}; 1)$ denotes the cumulative distribution function (CDF) of the chi-square distribution with one degree of freedom, evaluated at $X^{2}$ , and $X^{2}$ is the chi-square test statistic, as shown in Equation (13).

$X^{2} = \sum_{i, j} \frac{{(c_{i j} - E_{i j})}^{2}}{E_{i j}}$

(13)

where $i, j \in {0, 1}$ , $c_{i j}$ represents the number of bits where the extracted watermark is i and the embedded watermark is j and $E_{i j} = \frac{\sum_{k} x_{i k} \cdot \sum_{k} x_{k i}}{n^{2}}$ is the expected count.

3.: Watermark Success Rate (WSR): This is the degree of bit-level matching between the extracted and original watermarks, as calculated by Equation (14).

$WSR = \frac{1}{N} \sum 1 (\hat{b_{i}} = b_{i})$

(14)

where L is the length of the watermark, $1 (\cdot)$ is the indicator function, $\hat{b_{i}}$ is the i-th bit of the embedded watermark, and $\hat{b_{i}}$ is the i-th bit of the extracted watermark. A higher WSR indicates a more accurate watermark embedding results.

4.2. Experimental Results and Analysis

This section reports comparative and ablation studies, synthesizes the main findings, and quantifies OFSW’s advantages and the contributions of its modules.

To validate the performance of the OFSW method, we conducted the following two sets of experiments.

Comparative Experiments: We compared OFSW with other existing black-box model watermarking methods. By evaluating their performance on the same models and datasets under various trigger conditions, we assessed the effectiveness of the watermark embedding scheme of OFSW and its impact on the functionality of the original model, thereby validating the effectiveness and superiority of our method.
Ablation Studies: To analyse the contribution of each module contained in the OFSW method, we conducted four sets of ablation studies. These experiments clarified the contribution of each component to the overall method.

The specific experimental results are as follows.

4.2.1. Comparative Experiments

This section compares the performance of OFSW with that of methods such as Backdoor, UBM, and C-H, demonstrating the advantages of OFSW in terms of effectiveness and nonharmfulness.

To demonstrate the effectiveness and superiority of OFSW, we compared its performance with those of other watermarking methods. The compared methods included backdoor-based watermarking [16], UBM (Universal BlackMarks) [18], and class-hidden watermarking (c-h) [29].

The first set of comparative experiments used ResNet-18 as the target model and the CIFAR-10 dataset. The results are shown in Table 1, Table 2, Table 3 and Table 4.

As shown in the above tables, with random noise triggers and watermark sizes of 64, 256, and 1024, the accuracies of OFSW improved by 0.96–9.14%, 1.31–9.66%, and −0.04–10.42%, respectively, compared with those of the other methods. With patch image triggers, the accuracy improvements were 1.89–9.85%, 1.47–9.61%, and 1.55–10.69%, respectively. With masked image triggers, the improvements were 1.74–9.33%, 1.06–9.33%, and 0.48–9.29%, respectively. With irrelevant image triggers, the improvements were 0.16–9.44%, −0.49–9.11%, and 0.44–10.15%, respectively.

The second set of comparisons was based on the deeper ResNet-101 model, which used ImageNet-100 for testing and validation purposes. The results are shown in Table 5, Table 6, Table 7 and Table 8.

We conducted a systematic comparison in Table 1, Table 2, Table 3 and Table 4 (ResNet-18/CIFAR-10) and Table 5, Table 6, Table 7 and Table 8 (ResNet-101/ImageNet-100), covering four trigger types (random noise, patches, occlusions, and unrelated images) and three sizes (64/256/1024). The main conclusions are as follows:

WSR (robustness). OFSW achieves WSR = 1.00 in all 24/24 settings. In contrast, that of UBM often falls below 1 with large triggers (e.g., 0.864 for ResNet-18/occlusion-1024, 0.864 for ResNet-101/unrelated-1024, and 0.993–0.998 for several 1024 configurations), while those of Backdoor and C-H are typically 1. Without sacrificing verifiability, OFSW provides consistently perfect WSR.

Test Acc (harmlessness/performance retention). On ResNet-18/CIFAR-10, the ranges of improvement in the accuracy of OFSW over those of other methods are as follows: +0.96 to +10.42 pp for random noise, +1.47 to +10.69 pp for patches, +0.48 to +9.33 pp for occlusions, and −0.49 to +10.15 pp for unrelated images. On ResNet-101/ImageNet-100, the ranges are +1.00 to +17.53 pp for random noise, +1.63 to +19.29 pp for patches, +0.31 to +13.11 pp for occlusions, and −0.08 to +10.69 pp for unrelated images. In a few cases, OFSW’s accuracy is slightly lower (−0.49 pp/−0.08 pp at the worst), but these are rare exceptions and do not change the overall trend.

All comparisons yield extremely small p-values (typically ≤ 1.6 × 10⁻¹³), indicating statistical significance. Across all 24 experimental settings (four trigger types × three sizes), OFSW consistently maintains WSR = 1.00 and attains the highest or second-highest test accuracy in the vast majority of cases; the improvement over Backdoor is the most pronounced (up to +19.29 pp), whereas compared with UBM, OFSW delivers both higher accuracy and a more stable WSR.

4.2.2. Ablation Studies

This section analyses the effects of the orthogonalization and convolutional embedding modules as well as different linear decoders.

Analysis of different modules. We quantify the individual contributions of the orthogonalization module (module C) and the convolutional embedding module (module M) by comparing the full method (OFSW) with two variants: OFSW-C (orthogonalization only; the convolution is removed and replaced with random black-region occlusion) and OFSW-M (convolution only; orthogonalization is removed). On CIFAR-10/ResNet-18 and ImageNet-100/ResNet-101, we evaluate multiple trigger types (noise, image patches, occlusions, and unrelated images) and watermark bit lengths (64/256/1024); the results are summarized in Table 9, Table 10, Table 11, Table 12, Table 13, Table 14, Table 15 and Table 16.

Impact of different watermark embedding approaches. With identical settings (trigger type and bit length), we further evaluate three embedding approaches—ridge regression (RR), LASSO, and an SVD-based method—and report the test accuracy, p-value, and WSR; see Table 17, Table 18, Table 19, Table 20, Table 21, Table 22, Table 23 and Table 24. This separates the impact of the embedding function class from that of the embedding module.

(A) Analysis of module effects. Building on the full OFSW method, we construct OFSW-C (retaining only the orthogonalization module and replacing the convolutional embedding with random black-region occlusions) and OFSW-M (retaining only the convolutional embedding module while removing orthogonalization). Across CIFAR-10/ResNet-18 and ImageNet-100/ResNet-101, covering the four triggers (noise, image patches, occlusions, and unrelated image) and three watermark bit lengths (64/256/1024), we uniformly report the test accuracy, statistical significance, and watermark success rate. This design serves to evaluate the role of the orthogonalization module in preserving primary-task performance and to examine the contribution of the convolutional embedding module to watermark effectiveness and robustness.

We compared the watermarking variants with only the orthogonalization module (OFSW-C) and only the convolution module (OFSW-M). The orthogonalization module helped maintain the accuracy of the model after performing watermark embedding. When the convolution module was removed but the orthogonalization module was retained, the loss in model accuracy was very small. In some cases, a slight improvement was even achieved. However, when the orthogonalization module was removed and only the convolution module was retained, the accuracy decreased significantly. This was especially true with high-bit watermarks.

Overall, under all the experimental conditions, the accuracies of the OFSW-C version (with only the orthogonalization module) were −0.12% to 8.79% higher than those of the OFSW-M version (with only the convolution module).

In most cases, the contribution of the orthogonalization module to accuracy was positive and significant. The appearance of a negative value (−0.12%) was due to minor data fluctuations. This can be considered negligible. This further highlights the stability of the M-module in terms of maintaining performance. The accuracy of OFSW-C could reach 8.79% higher than that of OFSW-M when the watermark bit count was 1024. This clearly shows that the orthogonalization module is a key component in multibit watermark embedding tasks. This ensures that the core functionality of the target model is not significantly weakened.

With respect to the WSR, the situation is completely reversed. The convolution module is central to ensuring the robustness and effectiveness of watermark. Its role far outweighs that of the orthogonalization module. When the orthogonalization module was removed but the convolution module was retained, the WSR remained at a high level. Conversely, when the C-module was removed and only the M-module was retained, the WSR decreased significantly. Across all trigger sizes, the OFSW-M version (with only the convolution module) had WSRs that were 0.20% to 22.70% higher than those of the OFSW-C version (with only the orthogonalization module). The largest difference of 22.70% occurred in the 1024-bit watermark test. This finding indicates that in multibit watermarking tasks, the effectiveness of the watermark is severely compromised without the support of the convolution module.

(B) Impact of different watermark embedding methods: In Section 3.3, we choose ridge regression (RR) as the watermark embedding method. To assess the effectiveness of other embedding functions for OFSW, in this section, the comparative experiments are described. The candidate embedding functions are listed as follows.

LASSO. By adding an $l_{1}$ regularization term ( $λ {∥ W ∥}_{1}$ ) to the least-squares objective, LASSO induces coefficient sparsity and performs variable selection. Its effect is the soft-thresholding/shrinkage of coefficients: small coefficients are driven to zero, thereby reducing variance, improving interpretability, and achieving more robust generalization in high-dimensional or multicollinearity settings.

$\hat{W} = \arg \min_{W \in ℝ^{p}} \frac{1}{2 n} {∥ Y - X W ∥}_{2}^{2} + λ {∥ W ∥}_{1}$
SVD. The input matrix is decomposed as $X = U Σ V^{⊤}$ , which provides a spectral view considering an orthonormal basis. SVD is commonly used for dimensionality reduction, diagnosing collinearity, and constructing preconditioners.

${\hat{W}}_{LS} = X^{+} Y = V Σ^{- 1} U^{⊤} Y, where X = U Σ V^{⊤}$

Based on Table 17, Table 18, Table 19, Table 20, Table 21, Table 22, Table 23 and Table 24, RR is the most robust and overall best at preserving primary-task accuracy. Across all trigger types and watermark bit lengths (64/256/1024) on ResNet-18/CIFAR-10 and ResNet-101/ImageNet-100, RR almost always achieves the highest test accuracy and the smallest difference with respect to the no-watermark baseline (NO WM). Compared with LASSO and SVD, RR’s advantage typically lies in the 0.3–1.8 percentage-point range; for example: CIFAR-10 /noise/256-bit: RR 90.26 vs. SVD 88.55 (+1.71 pp); CIFAR-10/occlusion/64-bit: RR 90.60 vs. LASSO 89.32 (+1.28 pp); ImageNet-100/patch/64-bit: RR 84.55 vs. second-best 83.33 (+1.22 pp); and ImageNet-100/unrelated-image/1024-bit: RR 83.97 vs. SVD 82.54 (+1.43 pp). This consistency suggests that, in multibit watermarking, RR perturbs discriminative features the least and delivers the most stable generalizability.

5. Discussion and Analysis

This section presents a robustness analysis of OFSW under adaptive attacks and a cross-modal (GPT-2) case study to evaluate its practicality and transferability.

To assess OFSW’s practicality and transferability, we organize the discussion around two themes: “robustness” and “cross-modal generalization.” First, at the parameter level, we construct three representative adaptive attacks—fine-tuning, overwriting, and unlearning—to systematically test whether an adversary without access to the trigger samples and/or with only partial knowledge of the target watermark can simultaneously preserve primary-task performance (Test Acc.) and effectively decrease the watermark’s statistical significance (p-value) and watermark success rate (WSR). See Table 25 for the results. Next, we extend OFSW from the image domain to text generation: using GPT-2 as the backbone, we embed multibit watermarks (32/64/128) on WikiText, BookCorpus, PTB-text-only, and LAMBADA and evaluate the method’s overhead and verifiability in language models via the perplexity (PPL), p-value, and WSR; see Table 26. Together, these two parts address two core questions: (i) whether OFSW remains reliably provable under realistic attacks and (ii) whether OFSW’s design maintains stable watermark detectability and usability when it is transferred across modalities.

(A) Effect of different attacks on watermarks embedded by OFSW. In real deployments, a model-stealing adversary may become aware of the watermark and design adaptive attacks to evade or weaken verification. Concretely, they modify model parameters to affect watermark embedding and extraction. Existing watermark-breaking techniques fall into three broad categories: (1) fine-tuning attack; (2) overwriting attack; and (3) unlearning attack.

Scenario 1 (Fine-Tuning Attack).

Assume that the adversary knows the overall EaaW pipeline but not the trigger samples used by the model owner nor the target watermark. Without introducing any watermark-related objective, the adversary continues training the stolen model on clean data (in-domain or cross-domain), optionally using common tricks such as early stopping, weight decay, and data augmentation. The goal is to slightly perturb the parameters—and the corresponding explanations—to weaken the alignment between the model explanations and the original watermark while largely preserving the primary task. We refer to this lightweight, parameter-level modification as a fine-tuning attack.

Scenario 2 (Overwriting Attack).

Assume that the adversary knows the EaaW pipeline but not the original trigger samples or target watermark. The adversary independently generates a new set of trigger samples and a new watermark and then optimizes a watermark-related loss on the stolen model in an attempt to write in the new watermark and overwrite the original signal, thereby rendering the original watermark invalid at verification. This adaptive strategy is termed an overwriting attack.

Scenario 3 (Unlearning Attack).

Assume that the adversary knows the embedded target watermark but still does not know the trigger samples. The adversary randomly selects or synthesizes substitute trigger samples and updates the model in the direction opposing the watermark gradient (reducing the watermark score/margin) to actively “unlearn” the original watermark while maintaining primary-task performance. This assumption is realistic: target watermarks are often guessable (e.g., a company logo or a developer’s avatar). This type is referred to as an unlearning attack.

Across three representative adaptive attacks—fine-tuning, overwriting, and unlearning (Table 26)—the OFSW watermark is not effectively removed; only different degrees of performance degradation and decreases in matching rate are observed.

Fine-tuning setting (continue training on clean data only, no watermark objective): The test accuracy is 85.63% (down 4.21 pp from 89.84% before embedding), the original watermark has WSR = 0.887, and $p = 2.01 \times 10^{- 135}$ remains extremely low, indicating that light parameter perturbations struggle to undermine the watermark’s statistical significance.
Overwriting setting (an attacker-chosen watermark that conflicts with the original is written into the victim model; converges after 10 epochs): The accuracy decreases the “coexistence of two watermarks” rather than true “overwriting.”
Unlearning setting (updates are taken opposite the watermark gradient): The accuracy is 85.27% (−4.57 pp), with WSR = 0.872 for the original watermark and $p = 2.83 \times 10^{- 125}$ still being extremely low.

Overall, none of the three attacks can remove the OFSW watermark without incurring substantial primary-task degradation.

(B) Extending from images to text.

To transfer our method from the image domain to text, we replace 2D convolutions in the network with 1D convolutions along the sequence dimension. Concretely, a sentence is mapped to a matrix

E \in ℝ^{L \times d}

(length L and embedding dimension d; each row is a token vector). A shared convolutional kernel

W \in ℝ^{k \times d}

of width k slides along the sequence, computing at position i:

y_{i} = σ (〈 W, E_{i : i + k - 1} 〉 + b)

. This design transfers inductive biases of image convolutions to text—local receptive fields, parameter sharing, and translation invariance—effectively capturing n-gram patterns while retaining linear time complexity

O (L k d)

and good throughput. Aside from modality-specific form factors, the training objectives and watermark embedding pipeline match the image setup.

We use GPT-2 [45] as a case study for applying OFSW to text generation models, chosen because it is a representative open-source transformer and many stronger LMs share similar architectures. We fine-tune GPT-2 and embed multibit watermarks on WikiText [46], BookCorpus [47], PTB-text-only [48], and LAMBADA [49]. Specifically, we randomly select a training sequence as the trigger sample and randomly generate a kkk-bit string as the watermark, with the lengths set to 32, 64, and 128.

After 32/64/128-bit watermarks are embedded into GPT-2 on WikiText, BookCorpus, PTB-text-only, and LAMBADA, all 12/12 experiments achieve WSR = 1, and the p-values are consistently very small. For “No WM,” the absolute PPL increments are +0.56–+3.58 (average +1.95 across the four datasets) for 32-bit, +1.63–+5.27 (average +3.12) for 64-bit, and +5.88–+9.49 (average +7.25) for 128-bit. Among the datasets, BookCorpus shows the smallest increases at 32/64 bits, whereas PTB-text-only shows a comparatively larger increase at 128 bits. Overall, the security indicators saturate (WSR = 1, extremely small p-values), and PPL degradation is positively correlated with watermark length: 32/64 bits yield only light-to-moderate increases in perplexity in most scenarios, whereas 128 bits provide stronger separability at a higher cost. In sum, OFSW exhibits strong practical effectiveness and broad applicability.

Synthesizing the three adaptive attacks and the cross-modal experiments, OFSW strikes a favourable balance between statistical detectability (very small p-values and a stable WSR) and functionality preservation (controlled accuracy/PPL cost). Even under practically feasible parameter-level attacks—fine-tuning, overwriting, and unlearning—the original watermark generally cannot be removed without noticeably sacrificing primary-task performance. Moreover, after extending 2D convolutions to 1D sequence convolutions, the method shows equally stable detectability on GPT-2, confirming OFSW’s transferability and generality.

Consider the algorithmic complexity of OFSW. The time complexity has two main components: the standard convolution operations—one layer with F kernels, each applied over D samples with d parameters—yielding a cost of

O (D F d)

; and the extra overhead from the soft-orthogonality constraint, which requires computing angles from the batch response matrix of the kernels

S \in ℝ^{D \times F}

. This is equivalent to one matrix multiplication

S^{T} S

with complexity

O (D F^{2})

. Hence, the overall time complexity is

O (D F (d + F))

.

To embed a watermark while preserving model utility as much as possible, we preferentially embed it into the orthogonal complement of the model’s representation function space. Our current implementation selects orthogonal convolution kernels before training; since orthogonal kernels may not perfectly match the characteristics of the trigger samples, future work can introduce automated kernel selection/optimization modules and explore additional optimization techniques to further reduce the overhead. Notably, although the impact of OFSW on the model is typically negligible, it is intrinsically an intrusive watermark. Motivated by this, we plan to explore non-intrusive watermarking schemes in future work to achieve verifiability and traceability with zero or minimal modifications to the protected model.

6. Conclusions

In this paper, we combine explanation learning with orthogonal regularization to propose a novel model watermarking technique. We call this the OFSW method, and it protects the rights of model owners. We address the problems faced by the existing methods from the perspective of the image feature space. We embed the watermark into new, explainable network parameters. This simultaneously improves the success rate of watermark embedding and the performance of the target model. We hope that our algorithm can contribute to the upcoming era of large models, in which large models are regarded as valuable intellectual property by practitioners. They are respected and protected. We also envision that this algorithm can be combined with federated learning [50,51] and edge computing algorithms [52,53]. Together, they can build privacy-preserving personal models.

Author Contributions

Conceptualization, F.Y. and J.G.; Data curation, C.S. and Y.H.; Formal analysis, F.Y. and C.S.; Funding acquisition, J.G.; Investigation, F.Y.; Methodology, F.Y.; Project administration, J.G. and H.R.; Resources, J.G.; Software, F.Y. and C.S.; Supervision, J.G. and H.R.; Validation, F.Y., C.S. and Y.H.; Visualization, F.Y.; Writing—original draft, F.Y. and J.G.; Writing—review & editing, Y.H. and J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (grant no. 62272242).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study used two publicly available datasets (both accessed on 24 July 2025): Cifar10 (https://www.cs.toronto.edu/~kriz/cifar.html); Imagenet100 (https://www.kaggle.com/datasets/ambityga/imagenet100).

Conflicts of Interest

The authors declare that there are no conflicts of interest.

References

Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chowdhary, K.R. Natural language processing. In Fundamentals of Artificial Intelligence; Chowdhary, K.R., Ed.; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing System, Vancouver, BC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Ko, H.; Lee, S.; Park, Y.; Choi, A. A survey of recommendation systems: Recommendation models, techniques, and application fields. Electronics 2022, 11, 141. [Google Scholar] [CrossRef]
Mohammadmehdi, N.; Rahmani, H.A.; Deldjoo, Y. Cpfair: Personalized consumer and producer fairness re-ranking for recommender systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022. [Google Scholar]
Yu, J.; Yin, H.; Xia, X.; Chen, T.; Li, J.; Huang, Z. Self-supervised learning for recommender systems: A survey. IEEE Trans. Knowl. Data Eng. 2023, 36, 335. [Google Scholar] [CrossRef]
Yan, Y.; Pan, X.; Zhang, M.; Yang, M. Rethinking white-box watermarks on deep learning models under neural structural obfuscation. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; USENIX Association: Berkeley, CA, USA, 2023; Volume 2, pp. 2347–2364. [Google Scholar]
Pan, X.; Zhang, M.; Yan, Y.; Wang, Y.; Yang, M. Cracking white-box DNN watermarks via invariant neuron transforms. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; ACM: New York, NY, USA, 2023; pp. 1783–1794. [Google Scholar]
Chen, H.; Rouhani, B.D.; Fu, C.; Zhao, J.; Koushanfar, F. Deepmarks: A secure fingerprinting framework for digital rights management of deep learning models. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; ACM: New York, NY, USA, 2019; pp. 105–113. [Google Scholar]
Wang, T.; Kerschbaum, F. RIGA: Covert and robust white-box watermarking of deep neural networks. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; ACM: New York, NY, USA, 2021; pp. 993–1004. [Google Scholar]
Fan, L.; Ng, K.W.; Chan, C.S.; Yang, Q. DeepIPR: Deep neural network ownership verification with passports. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6122–6139. [Google Scholar] [CrossRef]
Chen, J.; Wang, J.; Peng, T.; Sun, Y.; Cheng, P.; Ji, S.; Ma, X.; Li, B.; Song, D. Copy, right? A testing framework for copyright protection of deep learning models. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 824–841. [Google Scholar]
Adi, Y.; Baum, C.; Cisse, M.; Pinkas, B.; Keshet, J. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; USENIX Association: Berkeley, CA, USA, 2018; Volume 2, pp. 1615–1631. [Google Scholar]
Le Merrer, E.; Pérez, P.; Trédan, G. Adversarial frontier stitching for remote neural network watermarking. Neural Comput. Appl. 2020, 32, 9233–9244. [Google Scholar] [CrossRef]
Li, L.; Zhang, W.; Barni, M. Universal blackmarks: Key-image-free blackbox multi-bit watermarking of deep neural networks. IEEE Signal Process. Lett. 2023, 30, 36–40. [Google Scholar] [CrossRef]
Lv, P.; Li, P.; Zhang, S.; Chen, K.; Liang, R.; Ma, H.; Zhao, Y.; Li, Y. A robustness-assured white-box watermark in neural networks. IEEE Trans. Dependable Secure Comput. 2023, 20, 5214–5229. [Google Scholar] [CrossRef]
Fan, L.; Ng, K.W.; Chan, C.S. Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2022. [Google Scholar]
Li, Y.; Zhu, L.; Jia, X.; Bai, Y.; Jiang, Y.; Xia, S.-T.; Cao, X.; Ren, K. MOVE: Effective and harmless ownership verification via embedded external features. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 47, 4734–4751. [Google Scholar] [CrossRef]
Krauß, T.; Stang, J.; Dmitrienko, A. Clearstamp: A human-visible and robust model-ownership proof based on transposed model training. In Proceedings of the SEC ‘24: Proceedings of the 33rd USENIX Conference on Security Symposium, Philadelphia, PA, USA, 14–16 August 2024; USENIX Association: Berkeley, CA, USA, 2024; pp. 5269–5286. [Google Scholar]
Lee, S.; Song, W.; Jana, S.; Cha, M.; Son, S. Evaluating the robustness of trigger set-based watermarks embedded in deep neural networks. IEEE Trans. Dependable Secure Comput. 2022, 20, 3434–3448. [Google Scholar] [CrossRef]
Li, Y.; Jiang, Y.; Li, Z.; Xia, S.-T. Backdoor learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5–22. [Google Scholar] [CrossRef]
Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access 2019, 7, 47230–47244. [Google Scholar] [CrossRef]
Chang, X.; Xiang, T.; Hospedales, T.M. Deep multi-view learning with stochastic decorrelation loss. arXiv 2017, arXiv:1707.09669. [Google Scholar]
Jia, H.; Choquette-Choo, C.A.; Chandrasekaran, V.; Papernot, N. Entangled watermarks as a defense against model extraction. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; IEEE: New York, NY, USA, 2021; pp. 1937–1954. [Google Scholar]
Zhang, J.; Gu, Z.; Jang, J.; Wu, H.; Stoecklin, M.P.; Huang, H.; Molloy, I. Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, Incheon, Republic of Korea, 4–8 June 2018; ACM: New York, NY, USA, 2018; pp. 159–172. [Google Scholar]
Chen, W.; Zhang, C.; Zhang, W.; Cai, J. Class-hidden client-side watermarking in federated learning. Entropy 2025, 27, 134. [Google Scholar] [CrossRef] [PubMed]
Lansari, M.; Bellafqira, R.; Kapusta, K.; Thouvenot, V.; Bettan, O.; Coatrieux, G. When federated learning meets watermarking: A comprehensive overview of techniques for intellectual property protection. Mach. Learn. Knowl. Extr. 2023, 5, 1382–1406. [Google Scholar] [CrossRef]
Li, P.; Cheng, P.; Li, F.; Du, W.; Zhao, H.; Liu, G. PLMmark: A secure and robust black-box watermarking framework for pre-trained language models. Proc. AAAI Conf. Artif. Intell. 2023, 37, 14991–14999. [Google Scholar] [CrossRef]
Lim, J.H.; Chan, C.S.; Ng, K.W.; Fan, L.; Yang, Q. Protect, show, attend and tell: Empowering image captioning models with ownership protection. Pattern Recognit. 2022, 122, 108285. [Google Scholar] [CrossRef]
Yao, H.; Lou, J.; Qin, Z. PoisonPrompt: Backdoor attack on prompt-based large language models. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7745–7749. [Google Scholar]
Yao, H.; Lou, J.; Qin, Z.; Ren, K. PromptCARE: Prompt copyright protection by watermark injection and verification. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 20–22 May 2024; IEEE: San Diego, CA, USA, 2024; pp. 845–861. [Google Scholar]
Zhong, Q.; Zhang, L.Y.; Zhang, J.; Gao, L.; Xiang, Y. Protecting IP of deep neural networks with watermarking: A new label helps. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 24th Pacific-Asia Conference, PAKDD 2020, Singapore, 11–14 May 2020, Proceedings, Part II 24; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 462–474. [Google Scholar]
Chen, H.; Rouhani, B.D.; Koushanfar, F. Blackmarks: Blackbox multibit watermarking for deep neural networks. arXiv 2019, arXiv:1904.00344. [Google Scholar] [CrossRef]
Boenisch, F. A systematic review on model watermarking for neural networks. Front. Big Data 2021, 4, 729663. [Google Scholar] [CrossRef]
Yoo, K.; Ahn, W.; Kwak, N. Advancing beyond identification: Multi-bit watermark for large language models. arXiv 2023, arXiv:2308.00221. [Google Scholar]
Abdelaziz, A.; Fathi, A.; Fares, A. Protecting intellectual property of EEG-based neural networks with watermarking. arXiv 2025, arXiv:2502.05931. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, W.; Wu, Y.; Wang, G. Self-orthogonality module: A network architecture plug-in for learning orthogonal filters. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1044–1048. [Google Scholar]
Vorontsov, E.; Trabelsi, C.; Kadoury, S.; Pal, C. On orthogonality and learning recurrent networks with long term dependencies. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3570–3578. [Google Scholar]
Singhal, R.; Rana, R. Chi-square test and its application in hypothesis testing. J. Pract. Cardiovasc. Sci. 2015, 1, 69. [Google Scholar] [CrossRef]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Tront, Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodel, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 September 2025).
Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer sentinel mixture models. arXiv 2016, arXiv:1609.07843. [Google Scholar] [CrossRef]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Marcinkiewicz, M.A. Building a large annotated corpus of English: The Penn Treebank. Using Large Corp. 1994, 273, 31. [Google Scholar]
Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernandez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv 2016, arXiv:1606.06031. [Google Scholar] [CrossRef]
Yang, Q.; Huang, A.; Fan, L.; Chan, C.S.; Lim, J.H.; Ng, K.W.; Ong, D.S.; Li, B. Federated learning with privacy-preserving and model IP-right-protection. Mach. Intell. Res. 2023, 20, 19–37. [Google Scholar] [CrossRef]
Nie, H.; Lu, S. FedCRMW: Federated model ownership verification with compression-resistant model watermarking. Expert Syst. Appl. 2024, 249, 123776. [Google Scholar] [CrossRef]
Nie, H.; Lu, S. Securing IP in edge AI: Neural network watermarking for multimodal models. Appl. Intell. 2024, 54, 10455–10472. [Google Scholar] [CrossRef]
Leroux, S.; Vanassche, S.; Simoens, P. Multi-bit, Black-box watermarking of deep neural networks in embedded applications. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2121–2130. [Google Scholar]

Figure 1. Framework of the proposed feature space-based watermarking method.

Figure 2. The model watermark embedding method based on an orthogonal feature space. First, trigger samples are selected, and trigger labels are added, minimizing the prediction loss between the predictions produced for the clean and trigger samples and their true labels. Second, an orthogonal regularization term is added to the parameters of the target model to maximize the projection of the watermark in the orthogonal subspace. Third, the watermark is extracted via a linear explanation module, and the embedding loss between the extracted and target watermarks is minimized.

Figure 3. Algorithmic framework of the linear explanation module. First, the representation dataset is constructed. Then, an evaluation vector is generated for the model predictions. Finally, ridge regression is used for linear fitting and decoding purposes.

Figure 4. Trigger samples used for the watermarked image classification models (top row) and the watermarks to be embedded (bottom row).

Table 1. Comparison among the results obtained on ResNet-18 (CIFAR-10) when using noise as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	82.72	1.60 × 10⁻¹³	1
	ubm	89.07	1.60 × 10⁻¹³	1
	c-h	89.42	1.60 × 10⁻¹³	1
	OFSW (ours)	90.28	1.60 × 10⁻¹³	1
256	backdoor	82.31	1.49 × 10⁻⁵⁴	1
	ubm	89.09	1.49 × 10⁻⁵⁴	1
	c-h	86.77	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	90.26	1.49 × 10⁻⁵⁴	1
1024	backdoor	82.18	9.97 × 10⁻²²¹	1
	ubm	89.15	2.17 × 10⁻²¹⁷	0.997
	c-h	80.70	9.97 × 10⁻²²¹	1
	OFSW (ours)	89.11	9.97 × 10⁻²²¹	1

Table 2. Comparison among the results obtained on ResNet-18 (CIFAR-10) when using a patch image as the trigger.

Trigger Size	Method	TEST ACC	p-Value	WSR
64	backdoor	82.64	1.60 × 10⁻¹³	1
	ubm	89.10	1.60 × 10⁻¹³	1
	c-h	87.53	1.60 × 10⁻¹³	1
	OFSW (ours)	90.78	1.60 × 10⁻¹³	1
256	backdoor	82.31	1.49 × 10⁻⁵⁴	1
	ubm	88.91	1.49 × 10⁻⁵⁴	1
	c-h	88.04	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	90.22	1.49 × 10⁻⁵⁴	1
1024	backdoor	81.16	9.97 × 10⁻²²¹	1
	ubm	88.47	1.70 × 10⁻²¹⁶	0.995
	c-h	88.18	9.97 × 10⁻²²¹	1
	OFSW (ours)	89.84	9.97 × 10⁻²²¹	1

Table 3. Comparison among the results obtained on ResNet-18 (CIFAR-10) when using a masked image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	82.87	1.60 × 10⁻¹³	1
	ubm	89.05	1.60 × 10⁻¹³	1
	c-h	88.93	1.60 × 10⁻¹³	1
	OFSW (ours)	90.60	1.60 × 10⁻¹³	1
256	backdoor	82.53	1.49 × 10⁻⁵⁴	1
	ubm	89.10	3.00 × 10⁻⁵¹	0.988
	c-h	89.28	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	90.23	1.49 × 10⁻⁵⁴	1
1024	backdoor	82.47	9.97 × 10⁻²²¹	1
	ubm	89.06	4.76 × 10⁻⁹⁰	0.864
	c-h	89.70	9.97 × 10⁻²²¹	1
	OFSW (ours)	90.13	9.97 × 10⁻²²¹	1

Table 4. Comparison among the results obtained on ResNet-18 (CIFAR-10) when using an irrelevant image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	82.65	1.60 × 10⁻¹³	1
	ubm	90.31	1.60 × 10⁻¹³	1
	c-h	90.21	1.60 × 10⁻¹³	1
	OFSW (ours)	90.45	1.60 × 10⁻¹³	1
256	backdoor	82.32	1.49 × 10⁻⁵⁴	1
	ubm	89.06	1.49 × 10⁻⁵⁴	1
	c-h	90.26	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	89.82	1.49 × 10⁻⁵⁴	1
1024	backdoor	81.15	1.49 × 10⁻⁵⁴	1
	ubm	88.53	1.69 × 10⁻²¹⁸	0.998
	c-h	89.00	9.97 × 10⁻²²¹	1
	OFSW (ours)	89.39	9.97 × 10⁻²²¹	1

Table 5. Comparison among the results obtained on ResNet-101 (ImageNet-100) when using noise as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	73.16	1.60 × 10⁻¹³	1
	ubm	82.92	1.60 × 10⁻¹³	1
	c-h	82.72	1.60 × 10⁻¹³	1
	OFSW (ours)	83.94	1.60 × 10⁻¹³	1
256	backdoor	73.70	1.49 × 10⁻⁵⁴	1
	ubm	82.30	1.49 × 10⁻⁵⁴	1
	c-h	78.10	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	83.12	1.49 × 10⁻⁵⁴	1
1024	backdoor	73.56	9.97 × 10⁻²²¹	1
	ubm	81.48	8.73 × 10⁻²¹⁵	0.993
	c-h	71.65	9.97 × 10⁻²²¹	1
	OFSW (ours)	84.21	9.97 × 10⁻²²¹	1

Table 6. Comparison among the results obtained on ResNet-101 (ImageNet-100) when using a patch image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	74.18	1.60 × 10⁻¹³	1
	ubm	82.89	1.60 × 10⁻¹³	1
	c-h	80.40	1.60 × 10⁻¹³	1
	OFSW (ours)	84.55	1.60 × 10⁻¹³	1
256	backdoor	70.34	1.49 × 10⁻⁵⁴	1
	ubm	82.50	1.49 × 10⁻⁵⁴	1
	c-h	80.92	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	83.91	1.49 × 10⁻⁵⁴	1
1024	backdoor	73.92	9.97 × 10⁻²²¹	1
	ubm	81.01	2.17 × 10⁻²¹⁷	0.997
	c-h	80.28	9.97 × 10⁻²²¹	1
	OFSW (ours)	82.33	9.97 × 10⁻²²¹	1

Table 7. Comparison among the results obtained on ResNet-101 (ImageNet-100) when using a masked image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	75.06	1.60 × 10⁻¹³	1
	ubm	83.10	1.60 × 10⁻¹³	1
	c-h	80.38	1.60 × 10⁻¹³	1
	OFSW (ours)	83.82	1.60 × 10⁻¹³	1
256	backdoor	74.08	1.49 × 10⁻⁵⁴	1
	ubm	82.17	1.49 × 10⁻⁵⁴	1
	c-h	82.49	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	83.79	1.49 × 10⁻⁵⁴	1
1024	backdoor	74.86	9.97 × 10⁻²²¹	1
	ubm	82.70	1.69 × 10⁻²¹⁸	0.998
	c-h	83.95	9.97 × 10⁻²²¹	1
	OFSW (ours)	84.21	9.97 × 10⁻²²¹	1

Table 8. Comparison among the results obtained on ResNet-101 (ImageNet-100) when using an irrelevant image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	backdoor	75.94	1.60 × 10⁻¹³	1
	ubm	83.13	1.60 × 10⁻¹³	1
	c-h	81.84	1.60 × 10⁻¹³	1
	OFSW (ours)	83.54	1.60 × 10⁻¹³	1
256	backdoor	75.92	1.49 × 10⁻⁵⁴	1
	ubm	82.78	1.49 × 10⁻⁵⁴	1
	c-h	83.30	1.49 × 10⁻⁵⁴	1
	OFSW (ours)	83.23	1.49 × 10⁻⁵⁴	1
1024	backdoor	75.86	9.97 × 10⁻²²¹	1
	ubm	82.76	4.76 × 10⁻⁹⁰	0.864
	c-h	83.03	9.97 × 10⁻²²¹	1
	OFSW (ours)	83.97	9.97 × 10⁻²²¹	1

Table 9. Ablation study results obtained on ResNet-18 (CIFAR-10) when using noise as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	90.56	1.60 × 10⁻¹³	1
	OFSW-C	90.45	1.60 × 10⁻¹³	1
	OFSW	90.28	1.60 × 10⁻¹³	1
256	OFSW-M	90.32	1.49 × 10⁻⁵⁴	1
	OFSW-C	90.63	1.49 × 10⁻⁵⁴	1
	OFSW	90.26	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	88.80	9.97 × 10⁻²²¹	1
	OFSW-C	90.47	8.60 × 10⁻²¹¹	0.991
	OFSW	89.11	9.97 × 10⁻²²¹	1

Table 10. Ablation study results obtained on ResNet-18 (CIFAR-10) when using a patch image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	90.43	1.60 × 10⁻¹³	1
	OFSW-C	90.64	1.60 × 10⁻¹³	1
	OFSW	90.78	1.60 × 10⁻¹³	1
256	OFSW-M	89.56	4.41 × 10⁻⁴¹	0.949
	OFSW-C	90.39	1.49 × 10⁻⁵⁴	1
	OFSW	90.22	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	83.15	1.31 × 10⁻⁷⁶	0.789
	OFSW-C	89.48	1.69 × 10⁻²¹⁸	0.998
	OFSW	89.84	9.97 × 10⁻²²¹	1

Table 11. Ablation study results obtained on ResNet-18 (CIFAR-10) when using a masked image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	90.59	1.60 × 10⁻¹³	1
	OFSW-C	90.60	1.60 × 10⁻¹³	1
	OFSW	90.60	1.60 × 10⁻¹³	1
256	OFSW-M	90.38	1.49 × 10⁻⁵⁴	1
	OFSW-C	90.42	3.00 × 10⁻⁵¹	0.988
	OFSW	90.23	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	88.98	2.99 × 10⁻²¹⁵	0.995
	OFSW-C	90.53	4.76 × 10⁻⁹⁰	0.864
	OFSW	90.13	9.97 × 10⁻²²¹	1

Table 12. Ablation study results obtained on ResNet-18 (CIFAR-10) when using an irrelevant image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	90.58	1.60 × 10⁻¹³	1
	OFSW-C	90.56	1.60 × 10⁻¹³	1
	OFSW	90.45	1.60 × 10⁻¹³	1
256	OFSW-M	89.8	3.04 × 10⁻⁵¹	0.988
	OFSW-C	90.34	1.49 × 10⁻⁵⁴	1
	OFSW	89.82	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	82.33	5.85 × 10⁻¹¹³	0.859
	OFSW-C	89.38	1.69 × 10⁻²¹⁸	0.998
	OFSW	89.39	9.97 × 10⁻²²¹	1

Table 13. Ablation study results obtained on ResNet-101 (ImageNet-100) when using noise as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	84.20	1.60 × 10⁻¹³	1
	OFSW-C	84.10	1.60 × 10⁻¹³	1
	OFSW	83.94	1.60 × 10⁻¹³	1
256	OFSW-M	83.18	1.49 × 10⁻⁵⁴	1
	OFSW-C	83.46	1.49 × 10⁻⁵⁴	1
	OFSW	83.12	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	83.92	9.97 × 10⁻²²¹	1
	OFSW-C	85.5	1.96 × 10⁻¹⁷⁶	0.947
	OFSW	84.21	9.97 × 10⁻²²¹	1

Table 14. Ablation study results obtained on ResNet-101 (ImageNet-100) when using a patch image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	84.22	1.60 × 10⁻¹³	1
	OFSW-C	84.42	1.60 × 10⁻¹³	1
	OFSW	84.55	1.60 × 10⁻¹³	1
256	OFSW-M	83.3	4.41 × 10⁻⁴¹	0.949
	OFSW-C	84.07	1.49 × 10⁻⁵⁴	1
	OFSW	83.91	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	76.2	5.15 × 10⁻²⁰⁸	0.985
	OFSW-C	82.00	7.29 × 10⁻¹⁴²	0.901
	OFSW	82.33	9.97 × 10⁻²²¹	1

Table 15. Ablation study results obtained on ResNet-101 (ImageNet-100) when using a masked image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	83.85	1.60 × 10⁻¹³	1
	OFSW-C	83.82	1.60 × 10⁻¹³	1
	OFSW	83.82	1.60 × 10⁻¹³	1
256	OFSW-M	83.93	1.49 × 10⁻⁵⁴	1
	OFSW-C	83.97	1.49 × 10⁻⁵⁴	1
	OFSW	83.79	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	83.14	9.97 × 10⁻²²¹	1
	OFSW-C	84.58	2.37 × 10⁻⁸⁷	0.815
	OFSW	84.21	9.97 × 10⁻²²¹	1

Table 16. Ablation study results obtained on ResNet-101 (ImageNet-100) when using an irrelevant image as the trigger.

Trigger Size	Method	Test Acc	p-Value	WSR
64	OFSW-M	83.66	1.60 × 10⁻¹³	1
	OFSW-C	83.64	1.60 × 10⁻¹³	1
	OFSW	83.54	1.60 × 10⁻¹³	1
256	OFSW-M	83.22	1.49 × 10⁻⁵⁴	1
	OFSW-C	83.72	1.49 × 10⁻⁵⁴	1
	OFSW	83.23	1.49 × 10⁻⁵⁴	1
1024	OFSW-M	77.34	4.93 × 10⁻⁹⁰	0.82
	OFSW-C	83.96	8.88 × 10⁻¹⁶⁶	0.933
	OFSW	83.97	9.97 × 10⁻²²¹	1

Table 17. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of noise triggers on ResNet-18/CIFAR-10: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	90.86	/	/
64	LASSO	89.44	1.60 × 10⁻¹³	1
	SVD	88.57	1.60 × 10⁻¹³	1
	RR	90.28	1.60 × 10⁻¹³	1
256	LASSO	89.42	1.49 × 10⁻⁵⁴	1
	SVD	88.55	1.49 × 10⁻⁵⁴	1
	RR	90.26	1.49 × 10⁻⁵⁴	1
1024	LASSO	88.28	9.97 × 10⁻²²¹	1
	SVD	87.42	9.97 × 10⁻²²¹	1
	RR	89.11	9.97 × 10⁻²²¹	1

Table 18. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of patch triggers on ResNet-18/CIFAR-10: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	90.86	/	/
64	LASSO	90.37	1.60 × 10⁻¹³	1
	SVD	89.78	1.60 × 10⁻¹³	1
	RR	90.78	1.60 × 10⁻¹³	1
256	LASSO	89.81	1.49 × 10⁻⁵⁴	1
	SVD	89.23	1.49 × 10⁻⁵⁴	1
	RR	90.22	1.49 × 10⁻⁵⁴	1
1024	LASSO	89.44	9.97 × 10⁻²²¹	1
	SVD	88.85	9.97 × 10⁻²²¹	1
	RR	89.84	9.97 × 10⁻²²¹	1

Table 19. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of masked image triggers on ResNet-18/CIFAR-10: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	90.86	/	/
64	LASSO	89.32	1.60 × 10⁻¹³	1
	SVD	89.04	1.60 × 10⁻¹³	1
	RR	90.60	1.60 × 10⁻¹³	1
256	LASSO	88.95	1.49 × 10⁻⁵⁴	1
	SVD	88.67	1.49 × 10⁻⁵⁴	1
	RR	90.23	1.49 × 10⁻⁵⁴	1
1024	LASSO	88.85	9.97 × 10⁻²²¹	1
	SVD	88.57	9.97 × 10⁻²²¹	1
	RR	90.13	9.97 × 10⁻²²¹	1

Table 20. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of irrelevant image triggers on ResNet-18/CIFAR-10: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	90.86	/	/
64	LASSO	88.99	1.60 × 10⁻¹³	1
	SVD	90.07	1.60 × 10⁻¹³	1
	RR	90.45	1.60 × 10⁻¹³	1
256	LASSO	88.37	1.49 × 10⁻⁵⁴	1
	SVD	89.44	1.49 × 10⁻⁵⁴	1
	RR	89.82	1.49 × 10⁻⁵⁴	1
1024	LASSO	87.95	9.97 × 10⁻²²¹	1
	SVD	89.01	9.97 × 10⁻²²¹	1
	RR	89.39	9.97 × 10⁻²²¹	1

Table 21. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of noise triggers on ResNet-101/ImageNet: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	85.31	/	/
64	LASSO	83.34	1.60 × 10⁻¹³	1
	SVD	83.06	1.60 × 10⁻¹³	1
	RR	83.94	1.60 × 10⁻¹³	1
256	LASSO	82.53	1.49 × 10⁻⁵⁴	1
	SVD	82.25	1.49 × 10⁻⁵⁴	1
	RR	83.12	1.49 × 10⁻⁵⁴	1
1024	LASSO	83.61	9.97 × 10⁻²²¹	1
	SVD	83.33	9.97 × 10⁻²²¹	1
	RR	84.21	9.97 × 10⁻²²¹	1

Table 22. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of patch triggers on ResNet-101/ImageNet: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	85.31	/	/
64	LASSO	83.30	1.60 × 10⁻¹³	1
	SVD	83.33	1.60 × 10⁻¹³	1
	RR	84.55	1.60 × 10⁻¹³	1
256	LASSO	82.67	1.49 × 10⁻⁵⁴	1
	SVD	82.70	1.49 × 10⁻⁵⁴	1
	RR	83.91	1.49 × 10⁻⁵⁴	1
1024	LASSO	81.11	9.97 × 10⁻²²¹	1
	SVD	81.14	9.97 × 10⁻²²¹	1
	RR	82.33	9.97 × 10⁻²²¹	1

Table 23. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of masked image triggers on ResNet-101/ImageNet: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	85.31	/	/
64	LASSO	82.69	1.60 × 10⁻¹³	1
	SVD	83.34	1.60 × 10⁻¹³	1
	RR	83.82	1.60 × 10⁻¹³	1
256	LASSO	82.66	1.49 × 10⁻⁵⁴	1
	SVD	83.31	1.49 × 10⁻⁵⁴	1
	RR	83.79	1.49 × 10⁻⁵⁴	1
1024	LASSO	83.07	9.97 × 10⁻²²¹	1
	SVD	83.73	9.97 × 10⁻²²¹	1
	RR	84.21	9.97 × 10⁻²²¹	1

Table 24. Comparison of linear decoders (LASSO, RR, and SVD) in the presence of irrelevant image triggers on ResNet-101/ImageNet: test accuracy, p-value, and WSR across trigger sizes (64/256/1024).

Trigger Size	Method	Test Acc	p-Value	WSR
	NO WM	85.31	/	/
64	LASSO	83.31	1.60 × 10⁻¹³	1
	SVD	82.11	1.60 × 10⁻¹³	1
	RR	83.54	1.60 × 10⁻¹³	1
256	LASSO	83.00	1.49 × 10⁻⁵⁴	1
	SVD	81.81	1.49 × 10⁻⁵⁴	1
	RR	83.23	1.49 × 10⁻⁵⁴	1
1024	LASSO	83.73	9.97 × 10⁻²²¹	1
	SVD	82.54	9.97 × 10⁻²²¹	1
	RR	83.97	9.97 × 10⁻²²¹	1

Table 25. Watermark (dubbed ‘Ori. WM’) and adversary’s new watermark (dubbed ‘New WM’), p-value, and functionality evaluation (test accuracy) of ResNet-18 against attack.

Metric	Test Acc.	p-Value	WSR of Ori. WM	WSR of New WM
Before	89.84	9.97 × 10⁻²²¹	1	/
After Fine Turning	85.63	2.01 × 10⁻¹³⁵	0.887	/
After Overwriting	73.17	1.05 × 10⁻¹⁴¹	0.896	0.804
After Unlearning	85.27	2.83 × 10⁻¹²⁵	0.872	/

Table 26. Evaluation of GPT-2 with OFSW across datasets: Perplexity, p-value, and WSR (lengths of 32/64/128).

Dataset	Length	No WM	32	64	128
Wikitext	PPL	43.41	46.99	48.68	51.14
	p-value	10 × 10⁻⁷	10 × 10⁻¹⁰	10 × 10⁻²⁰	10 × 10⁻²⁷
	WSR	1	1	1	1
bookcorpus	PPL	43.79	44.35	45.42	49.67
	p-value	10 × 10⁻⁷	10 × 10⁻¹⁰	10 × 10⁻²⁰	10 × 10⁻²⁷
	WSR	1	1	1	1
ptb-text-only	PPL	39.52	41.08	42.73	49.01
	p-value	10 × 10⁻⁷	10 × 10⁻¹⁰	10 × 10⁻²⁰	10 × 10⁻²⁷
	WSR	1	1	1	1
Lambada	PPL	42.16	44.26	44.52	48.07
	p-value	10 × 10⁻⁷	10 × 10⁻¹⁰	10 × 10⁻²⁰	10 × 10⁻²⁷
	WSR	1	1	1	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, F.; Sun, C.; Huang, Y.; Guo, J.; Ren, H. An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights. Electronics 2025, 14, 3888. https://doi.org/10.3390/electronics14193888

AMA Style

Yan F, Sun C, Huang Y, Guo J, Ren H. An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights. Electronics. 2025; 14(19):3888. https://doi.org/10.3390/electronics14193888

Chicago/Turabian Style

Yan, Fanfei, Chenhan Sun, Yuhan Huang, Jian Guo, and Hengyi Ren. 2025. "An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights" Electronics 14, no. 19: 3888. https://doi.org/10.3390/electronics14193888

APA Style

Yan, F., Sun, C., Huang, Y., Guo, J., & Ren, H. (2025). An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights. Electronics, 14(19), 3888. https://doi.org/10.3390/electronics14193888

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Orthogonal Feature Space as a Watermark: Harmless Model Ownership Verification by Watermarking Feature Weights

Abstract

1. Introduction

2. Related Works

3. Algorithm

3.1. Overall Framework of OFSW

3.2. Watermark Embedding

3.3. Watermark Extraction

3.4. Identity Verification

4. Experimental Results

4.1. Experimental Setup

4.2. Experimental Results and Analysis

4.2.1. Comparative Experiments

4.2.2. Ablation Studies

5. Discussion and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI