1. Introduction
In this section, we define the problem and underlying motivation, provide an overview of our proposed method and key contributions, and position our work within the broader contexts of model security and data privacy. We further articulate the central research questions and objectives that guide this study.
Deep neural networks (DNNs) have shown outstanding performance in recent years. They excel in computer vision [
1,
2,
3], natural language processing [
4,
5,
6], and recommendation systems [
7,
8,
9]. These networks are now widely applied across various domains. Training an effective DNN requires large-scale datasets in practice. It also demands extensive computational power and significant human resources. Protecting these costly models has become an urgent need [
10,
11], as unauthorized copying, tampering, and misuse must be prevented. Model watermarking is a key ownership protection technology. Its fundamental idea is simple. We embed a watermark that is difficult to detect and reverse into the target model. Ownership verification involves extracting the watermark using a specific method. The extracted watermark is then compared with the pre-embedded watermark.
Currently, two primary approaches are available for embedding watermarks in models: white-box and black-box watermarking. The core idea of white-box watermarking is to embed watermark information directly into the internal parameters of the target model without affecting its original functionality. Notable works in this area include DeepMarks [
12], Riga [
13], and DeepIPR [
14]. The advantage of white-box techniques is their minimal impact on the performance of the original model. However, they require full access to the internals of that model. In practical scenarios, most deep learning models are provided as online services, where third-party users can access their prediction results via only an application programming interface (API) and have no direct access to their internal parameters. This severely limits the applicability of white-box methods. In contrast, black-box watermarking enables effective verifications to be performed with only API-level access, thus aligning better with the prevalent service-oriented deployment models [
13] and garnering more attention from both academia and industry. Black-box watermarking typically involves implanting a “backdoor” trigger during training, causing the model to produce a predetermined, anomalous output for specific inputs, thereby marking the model [
15]. Adi et al. [
16] used backdoor attacks to force a deep learning model to memorize specific patterns, representing an early and influential 0-bit black-box watermarking method. Le Merrer et al. [
17] proposed embedding a watermark by fine-tuning a local region of the decision boundary. Shao et al. [
18] proposed embedding a multibit watermark in the feature space of the target model. The limitation of black-box watermarking is the uncertainty inherent in its embedding process, which often makes it difficult to guarantee stable watermark activation effects across all input conditions. Furthermore, specifically designed trigger samples can perturb the original decision boundary of the model, leading to incorrect predictions for normal inputs and thus degrading the performance achieved in the primary task.
To address the aforementioned limitations, a model watermarking method based on an orthogonal feature space is introduced in this paper. The core idea is to enforce an orthogonality constraint between the watermark-related features and the original task-decision features of the examined model, promoting a high degree of linear independence between them in the feature space. This strategy is inspired by the effectiveness of orthogonality in information representation separation and task decoupling scenarios. This helps minimize the interference imposed by the watermarking process on the primary task performance of the model, thereby increasing the watermark embedding success rate while reducing the induced performance loss. The main contributions of this paper are as follows.
We propose a harmless black-box watermarking method named orthogonal feature space watermarking (OFSW). In this method, we transform the feature representations of specific trigger samples into a watermark by adding a watermark-related constraint to the loss function. Simultaneously, we introduce an orthogonal regularization term to the loss function, which is designed to maintain orthogonality between the watermark features and the original task features of the target model. Owing to this regularization scheme, the watermark embedding process minimally interferes with the ability of the model to classify normal samples, thus preserving its performance in the primary task.
We apply an orthogonalization-promoting constraint to the parameter matrices of the model. This strategy further reduces the impact of OFSW on the standard predictions yielded by the model while simultaneously improving the success rate of watermark embedding.
We conduct extensive experiments on the ResNet-18 and ResNet-101 models, comparing OFSW with the existing watermarking techniques. The results demonstrate that OFSW has significant advantages in terms of both watermark effectiveness and its harmlessness to the target model.
2. Related Works
The existing model watermarking methods can be categorized into two types: white-box methods and black-box methods. This section provides a systematic review of white-box and black-box watermarking techniques as well as multibit schemes and their limitations, thereby establishing the research background.
White-box watermarking methods assume full access to the target model, including its architecture, parameters, and activation maps, during both the embedding and verification process. When a watermark is embedded into a DNN, the model owner typically modifies the model parameters directly to insert the watermark [
10,
19]. For example, Uchida et al. [
12] proposed a method that embeds a watermark by fine-tuning the target model with a watermark regularization term in the loss function. Watermarks can also be embedded by adjusting the model architecture [
19,
20], embedding external features [
21], introducing a transposed model [
22], using activation maps [
13,
23], or adding passport layers [
10,
23]. Similarly, during the verification process, the verifier is assumed to have full access to the model parameters. However, this assumption is often impractical in real-world applications, as most models are accessed via APIs. Therefore, the applicability of white-box methods is severely limited.
Black-box watermarking methods assume that only the output of a suspicious model can be observed during verification. They do not require direct access to the internal structure of the model. These methods are typically implemented using backdoor attack mechanisms [
24,
25]. The model owner implants a set of “trigger samples” during the training phase. This causes the DNN to produce a predefined, anomalous output when it encounters these specific triggers [
26,
27]. These trigger samples are proprietary and confidential data. Only the model owner knows them. The trigger set is fed to the model to verify its ownership. If the model produces the expected exclusive response, ownership is confirmed. Black-box methods have excellent task adaptability and deployment flexibility. They have been widely used in traditional image classification scenarios [
26,
28] and have also been successfully extended to other domains, including federated learning [
29,
30], text generation [
31,
32], and prompt engineering [
33,
34]. They achieve security goals such as ownership verification, infringement detection, and accountability.
Black-box watermarking methods can be classified into zero-bit and multibit watermarking methods. Their classification processes are based on the amount of information that is embedded. Zero-bit methods only indicate the presence or absence of a watermark. They do not store additional information. The developed model is trained to produce a fixed, anomalous response to a specific trigger set during the embedding procedure [
35]. The presence of the watermark is confirmed during verification if the misclassification rate exceeds a predefined threshold [
36]. Zero-bit schemes are simple to implement. They have low costs and are widely applicable. Their main weaknesses include their inability to carry identity information [
37] and their vulnerability to adversarial attacks [
31]. These weaknesses require mitigation through other techniques. Multibit methods aim to embed a string of information. This typically includes copyright identifiers such as digital signatures or owner identities. BlackMarks [
38] encodes a bit value (0 or 1) for each possible output class. It then generates a set of key-image–label pairs on the basis of a predetermined binary signature. The target model is fine-tuned to embed the specific behaviours that are associated with these pairs. Explanation as a Watermark (EaaW) [
18] embeds a multibit watermark into the feature space of the model. It does not alter the original predictions produced by the model for those samples. The watermark is embedded in the explanation output of the model, making this approach both stealthy and harmless. Compared with zero-bit watermarking, multibit methods offer significant advantages in terms of functionality and security. However, the existing research faces limitations and challenges. Improving the success rate of watermark embedding often results in reduced accuracy on the primary task. This necessitates a trade-off between the two goals [
39].
In this paper, we embed a multibit watermark in the feature space of a model to address these limitations. We introduce an orthogonal regularization term that forces the watermark to be nearly orthogonal to the primary task direction. It minimizes the interference with the decision boundary. This approach ensures reliable watermark embedding and preserves the normal predictive performance of the developed model.
3. Algorithm
This chapter details the design philosophy, overall framework, and specific implementation of OFSW. In this section, we present and formalize our OFSW algorithm and introduce its core methodology.
3.1. Overall Framework of OFSW
This section decomposes and elucidates the three modules of OFSW as well as the statistical methods for orthogonality.
The basic idea of the OFSW model is as follows. We represent a machine learning model as spanning a function space, and using orthogonalization, we identify an orthogonal complement space in which we embed our watermark. Owing to the orthogonality between the complement space and the function space, modifying the position of an input variable in the orthogonal complement space does not alter its position in the function space, thereby effectively reducing the impact on the model itself that commonly arises in black-box watermarking techniques during the embedding process. In the watermark embedding stage, we transform the watermark into different feature representations within the feature space. To achieve this, we employ convolutional kernels derived from multiple perspectives. Unlike conventional approaches, we do not directly modify the labels of trigger samples in watermark classes, which could mislead the model. Instead, we introduce a kernel orthogonalization module to mitigate the impact of watermark embedding on model performance. This module applies orthogonal constraints to the convolutional kernels and fully connected layer weights during training.
The OFSW framework consists of three main processes: watermark embedding, watermark extraction, and identity verification.
Figure 1 provides a brief illustration of this workflow. Watermark embedding involves adding specific and imperceptible identification information to the target model, which serves as proof of its ownership or origin. Watermark extraction analyses the model’s inputs and outputs to identify and verify the embedded watermark information, thereby confirming the model’s provenance. Identity verification uses the extracted watermark to validate the model’s legitimacy and the identity of its rightful owner, ensuring that the model is not misused or tampered with without authorization.
The OFSW framework is composed of three core modules: (1) the orthogonalization module, (2) the watermark interpretation module, and (3) the watermark comparison module.
During the embedding stage, the orthogonalization module and the watermark interpretation module provide critical gradient information, leading to the construction of a watermarked model whose parameter matrix spans a function space orthogonal to that spanned by the predefined convolutional kernels.
In the extraction stage, we input the predefined trigger samples into the watermarked model. The outputs, together with the predefined orthogonal convolutional kernels, are fed into the watermark interpretation module to compute the watermark. Specifically, the outputs serve as y, and the orthogonal convolutional kernels, once expanded and concatenated, form x. The watermark is then obtained as the weight matrix from the ridge regression of x and y, where negative coefficients represent 0-bits and positive coefficients represent 1-bits.
In the verification stage, the watermark comparison module calculates the similarity between the extracted watermark and the originally embedded watermark, and the p-value of a chi-square distribution is used as the evaluation metric.
Next, the specific algorithms developed for these three components and the implementation details of OFSW are elaborated upon.
3.2. Watermark Embedding
This section presents the multi-objective loss function (including the soft orthogonal regularization and embedding terms) and the training procedure and defines the embedding mechanism.
Figure 2 illustrates the workflow of the proposed method. For a given target model, we first complete its standard training process. A set of trigger samples is then selected from the dataset, serving as the foundation for subsequent ownership verification. If the model watermark is considered a lock imposed on the model, the trigger samples can be viewed as the key to that lock. During the watermark extraction stage, which is used for ownership authentication, both the trigger samples and the extraction method must be provided to retrieve the embedded watermark from the model. We then adopt a multi-objective optimization strategy that simultaneously minimizes the prediction loss induced by training samples. Orthogonal constraints are applied in the feature space to reserve a subspace orthogonal to the feature directions of the trigger samples while maximizing the projection of the watermark vector onto this orthogonal subspace. This ensures that the embedded watermark can be reliably extracted.
To compute the k-bit watermark for embedding purposes, we design a watermark explanation module for the target model. First, for each trigger sample, we construct k mutually orthogonal convolutional kernels and perform multiperspective convolution on the input image to obtain feature representations corresponding to the dimensionality of the watermark. These features are then fed into the model to obtain a set of evaluation vectors that measure the importance of each perspective based on the degree of matching between the output and the true labels. Subsequently, we linearly fit the evaluation vectors and the convolutional features to obtain a weight matrix, which is then quantized to 0/1 on the basis of the sign of the weights (see
Section 3.3 for calculation details). Finally, we save the model parameters after the watermark embedding, the trigger sample set, and the watermark vector to provide a complete basis for the subsequent watermark verification and ownership claim stages.
During the watermark embedding phase, the model owner embeds the watermark by fine-tuning a pretrained model. Concurrently, the owner must ensure that the performance of the model is minimally affected after the embedding process. To better balance the trade-off between the embedding success rate and model performance, the owner should strive to ensure that the feature space of the model and the watermark embedding space are mutually orthogonal.
Building upon the above objectives, we define the watermark embedding task as a multitask optimization problem with three goals: (1) preserving the original task of the target model; (2) enforcing orthogonality between the function space spanned by the parameter matrices of the target model and the function space spanned by our predefined convolutional kernels; and (3) ensuring that the weights obtained via ridge regression between the outputs of the trigger samples and our predefined kernels match the embedded watermark symbols. Based on these considerations, we propose a loss function, as shown in Equation (1).
where
θ represents the parameters of the target model, X denotes the clean samples, and X
T denotes the trigger samples. Y and Y
T are their corresponding labels. The orthogonality() function is used to evaluate the orthogonality of the parameters contained in the target model. The extract() function is used to extract the watermark from the target model. W is the embedded watermark, and r1 and r2 are coefficients.
Equation (1) consists of three parts.
The first part, L1, is the loss function of the initial deep neural network. This ensures that the predictions produced for both the clean dataset and the trigger set are consistent with their corresponding labels, thus preserving the functionality of the model.
The second part, L2, is aimed at promoting orthogonality among the model kernels. Intuitively, orthogonal kernels can better span the parameter space, especially in high-dimensional cases where the kernel dimensionality is greater than the number of kernels. Inspired by Ziming Zhang et al. [
40], we first approximate the angles between the kernels in each hidden layer using the kernel responses of the model. We subsequently drive the mean and variance of these angles towards 90° and 0°, respectively. This serves as the orthogonal regularization term, as shown in Equation (2).
where
represents the pool of kernel angles
in the i-th hidden layer and
and
represent the importance weights for each part. The first term is a weaker orthogonal regularization term, which is aimed at driving the mean angle between all pairs of weight matrices towards 90°. The second term is stricter, aiming to drive both the means and variances of the angles between all pairs of weight matrices towards 90° and 0°, respectively. Research conducted by Vorontsov et al. [
41] indicates that imposing hard orthogonality constraints on neural networks can reduce their convergence speeds and harm their performance, whereas soft orthogonality constraints can improve the training process. Considering both accuracy and computational speed, we set
in this paper.
In our algorithm, we use Equation (3) to estimate ϑ. Due to the linear transformation step, the means and variances corresponding to 90° and 0° in the kernel angle space are mapped to 0 in the ϑ space.
where tanh() is an entrywise function, γ is a scalar parameter,
wi is the i-th kernel vector in the hidden layer, x
T is the input of the hidden layer, N is the number of layers contained in the model,
wi is the i-th kernel of the α-th layer of this model, and
yi is the output produced after the computation of the i-th kernel.
To establish this conclusion, we first prove a lemma.
Lemma 1 [
]
. Without loss of generality, let be the angle between two vectors and be the unit ball in the d-dimensional space. We then havewhere is the expectation operator, sample is uniformly sampled from , and is the sign function returning 1 for positives and −1 otherwise. Proof. Let
,
. Since scaling does not change the sign,
Sampling uniformly from the unit ball and choosing directions uniformly from the unit sphere are equivalent at the “sign” level (as the radial length does not affect the sign). Hence, this can be equivalently viewed as uniformly sampling on .
Now, consider selecting a normal vector direction uniformly from the unit circle
.The vectors
and
are separated by a hyperplane passing through the origin if and only if the normal vector falls within two arc segments of total length
. Therefore,
which is equivalent to Equation (6). We then complete our proof. □
- 3.
The third part, L3, focuses on the watermark embedding success rate. Here, we choose a hinge-like loss, which is commonly used in computer vision tasks, as shown in Equation (7).
where
represents the i-th bit of the watermark extracted from the model,
represents the i-th bit of the watermark that we embed, k is the number of bits contained in the watermark, and ε is a control variable.
3.3. Watermark Extraction
This section introduces the linear interpretation module and ridge regression decoding, presents the corresponding pseudocode, and provides the decoding procedure for black-box watermark verification.
In this stage, we design a linear explanation module for extracting the watermark. We use k mutually orthogonal convolutional kernels to perform multiperspective convolution on the samples in the trigger set. This generates a representation dataset with the same dimensionality as that of the watermark. This representation dataset is then fed into the target model along with the true labels. We compute an evaluation vector that reflects the importance of each convolutional perspective. Next, we use ridge regression to fit the evaluation vector and the convolutional kernels. We binarize the resulting weight matrix. This finally yields our embedded k-bit binary watermark.
Figure 3 shows the main flow of the algorithm. It consists of three steps: (1) constructing the representation dataset; (2) evaluating the model predictions; and (3) performing linear fitting.
The specific implementation is described below.
For each image XT in the trigger set, to extract a k-bit watermark, we need to obtain feature maps from k different perspectives. Therefore, we randomly generate k d-dimensional vectors and orthogonalize them to obtain k convolutional kernels . Then, we perform a convolution operation on each image to obtain the representation dataset . Thus, for each image included in the trigger set, we can obtain a representation dataset, where n is the dimensionality of the input vector employed by the target model.
- 2.
Generating Evaluation Vectors for the Model Predictions
We input the representation dataset X
p into the target model to obtain the prediction result p = M(X
p; θ), which is a
matrix, where m is the length of the output vector. Then, we introduce an evaluation function, Evaluation(), as shown in Equation (8), to measure the consistency between this output vector and the target watermark vector.
where
is the evaluation vector,
is the corresponding label for
,
is the indicator function, j is the classification class, and
is the j-th element of the model output for the i-th input image in the representation dataset.
- 3.
Explaining the Watermark
After calculating the evaluation vector Y
W, we need a linear fitting method to quantify the contribution of each representation image contained in the representation dataset. We use ridge regression to solve for the weights W. Treating Y
W as the response variable y and the stacked column vectors of the flattened orthogonal convolutional kernels as the independent variable matrix X, the weight W derived from the ridge regression equation is the embedded watermark information. The ridge regression equation is shown in Equation (9).
where λ is a hyperparameter and W is the weight matrix obtained from
and
through ridge regression. Finally, we extract the watermark from the model using Equation (10).
where I is a
identity matrix. Through these steps, we fit the linear relationship between
and
, where the binarized weight matrix becomes our watermark. The pseudocode for the watermark extraction algorithm is given in Algorithm 1.
Algorithm 1. Watermark extraction algorithm. |
Input: Trigger samples XT., YT; predictions from model M; orthogonal convolutional kernels . |
Output: A k-bit vector representing the extracted watermark . |
1: Xp. = * XT. |
2: p = M(Xp.; Θ) |
3: YW = evaluation(p, YT) |
4: W = (M^T M + λI)^−1 M^T · YW |
5: = zero_like(W) |
6: for i = 0 to k − 1 do |
7: if W_i ≥ 0 then |
8: _i = 1 |
9: else |
10: _i = 0 |
11: return |
3.4. Identity Verification
This section presents a chi-square-test–based protocol for determining watermark consistency, thereby verifying ownership.
The final part of the OFSW method involves using the embedded and extracted watermarks to verify the identity of the model owner. When a model owner encounters a suspicious model
, they can use the trigger set and the set of orthogonal convolutional kernels
to extract a watermark
from the suspicious model
. To measure the similarity between
and the original watermark
, we use Pearson’s chi-square test [
42]. If the resulting
p-value is below a control parameter,
and
are considered the same watermark, and the model owner can claim that their model has been plagiarized or used without authorization.
5. Discussion and Analysis
This section presents a robustness analysis of OFSW under adaptive attacks and a cross-modal (GPT-2) case study to evaluate its practicality and transferability.
To assess OFSW’s practicality and transferability, we organize the discussion around two themes: “robustness” and “cross-modal generalization.” First, at the parameter level, we construct three representative adaptive attacks—fine-tuning, overwriting, and unlearning—to systematically test whether an adversary without access to the trigger samples and/or with only partial knowledge of the target watermark can simultaneously preserve primary-task performance (Test Acc.) and effectively decrease the watermark’s statistical significance (
p-value) and watermark success rate (WSR). See
Table 25 for the results. Next, we extend OFSW from the image domain to text generation: using GPT-2 as the backbone, we embed multibit watermarks (32/64/128) on WikiText, BookCorpus, PTB-text-only, and LAMBADA and evaluate the method’s overhead and verifiability in language models via the perplexity (PPL),
p-value, and WSR; see
Table 26. Together, these two parts address two core questions: (i) whether OFSW remains reliably provable under realistic attacks and (ii) whether OFSW’s design maintains stable watermark detectability and usability when it is transferred across modalities.
(A) Effect of different attacks on watermarks embedded by OFSW. In real deployments, a model-stealing adversary may become aware of the watermark and design adaptive attacks to evade or weaken verification. Concretely, they modify model parameters to affect watermark embedding and extraction. Existing watermark-breaking techniques fall into three broad categories: (1) fine-tuning attack; (2) overwriting attack; and (3) unlearning attack.
Scenario 1 (Fine-Tuning Attack).
Assume that the adversary knows the overall EaaW pipeline but not the trigger samples used by the model owner nor the target watermark. Without introducing any watermark-related objective, the adversary continues training the stolen model on clean data (in-domain or cross-domain), optionally using common tricks such as early stopping, weight decay, and data augmentation. The goal is to slightly perturb the parameters—and the corresponding explanations—to weaken the alignment between the model explanations and the original watermark while largely preserving the primary task. We refer to this lightweight, parameter-level modification as a fine-tuning attack.
Scenario 2 (Overwriting Attack).
Assume that the adversary knows the EaaW pipeline but not the original trigger samples or target watermark. The adversary independently generates a new set of trigger samples and a new watermark and then optimizes a watermark-related loss on the stolen model in an attempt to write in the new watermark and overwrite the original signal, thereby rendering the original watermark invalid at verification. This adaptive strategy is termed an overwriting attack.
Scenario 3 (Unlearning Attack).
Assume that the adversary knows the embedded target watermark but still does not know the trigger samples. The adversary randomly selects or synthesizes substitute trigger samples and updates the model in the direction opposing the watermark gradient (reducing the watermark score/margin) to actively “unlearn” the original watermark while maintaining primary-task performance. This assumption is realistic: target watermarks are often guessable (e.g., a company logo or a developer’s avatar). This type is referred to as an unlearning attack.
Across three representative adaptive attacks—fine-tuning, overwriting, and unlearning (
Table 26)—the OFSW watermark is not effectively removed; only different degrees of performance degradation and decreases in matching rate are observed.
Fine-tuning setting (continue training on clean data only, no watermark objective): The test accuracy is 85.63% (down 4.21 pp from 89.84% before embedding), the original watermark has WSR = 0.887, and remains extremely low, indicating that light parameter perturbations struggle to undermine the watermark’s statistical significance.
Overwriting setting (an attacker-chosen watermark that conflicts with the original is written into the victim model; converges after 10 epochs): The accuracy decreases the “coexistence of two watermarks” rather than true “overwriting.”
Unlearning setting (updates are taken opposite the watermark gradient): The accuracy is 85.27% (−4.57 pp), with WSR = 0.872 for the original watermark and still being extremely low.
Overall, none of the three attacks can remove the OFSW watermark without incurring substantial primary-task degradation.
(B) Extending from images to text.
To transfer our method from the image domain to text, we replace 2D convolutions in the network with 1D convolutions along the sequence dimension. Concretely, a sentence is mapped to a matrix (length L and embedding dimension d; each row is a token vector). A shared convolutional kernel of width k slides along the sequence, computing at position i: . This design transfers inductive biases of image convolutions to text—local receptive fields, parameter sharing, and translation invariance—effectively capturing n-gram patterns while retaining linear time complexity and good throughput. Aside from modality-specific form factors, the training objectives and watermark embedding pipeline match the image setup.
We use GPT-2 [
45] as a case study for applying OFSW to text generation models, chosen because it is a representative open-source transformer and many stronger LMs share similar architectures. We fine-tune GPT-2 and embed multibit watermarks on WikiText [
46], BookCorpus [
47], PTB-text-only [
48], and LAMBADA [
49]. Specifically, we randomly select a training sequence as the trigger sample and randomly generate a kkk-bit string as the watermark, with the lengths set to 32, 64, and 128.
After 32/64/128-bit watermarks are embedded into GPT-2 on WikiText, BookCorpus, PTB-text-only, and LAMBADA, all 12/12 experiments achieve WSR = 1, and the p-values are consistently very small. For “No WM,” the absolute PPL increments are +0.56–+3.58 (average +1.95 across the four datasets) for 32-bit, +1.63–+5.27 (average +3.12) for 64-bit, and +5.88–+9.49 (average +7.25) for 128-bit. Among the datasets, BookCorpus shows the smallest increases at 32/64 bits, whereas PTB-text-only shows a comparatively larger increase at 128 bits. Overall, the security indicators saturate (WSR = 1, extremely small p-values), and PPL degradation is positively correlated with watermark length: 32/64 bits yield only light-to-moderate increases in perplexity in most scenarios, whereas 128 bits provide stronger separability at a higher cost. In sum, OFSW exhibits strong practical effectiveness and broad applicability.
Synthesizing the three adaptive attacks and the cross-modal experiments, OFSW strikes a favourable balance between statistical detectability (very small p-values and a stable WSR) and functionality preservation (controlled accuracy/PPL cost). Even under practically feasible parameter-level attacks—fine-tuning, overwriting, and unlearning—the original watermark generally cannot be removed without noticeably sacrificing primary-task performance. Moreover, after extending 2D convolutions to 1D sequence convolutions, the method shows equally stable detectability on GPT-2, confirming OFSW’s transferability and generality.
Consider the algorithmic complexity of OFSW. The time complexity has two main components: the standard convolution operations—one layer with F kernels, each applied over D samples with d parameters—yielding a cost of ; and the extra overhead from the soft-orthogonality constraint, which requires computing angles from the batch response matrix of the kernels . This is equivalent to one matrix multiplication with complexity . Hence, the overall time complexity is .
To embed a watermark while preserving model utility as much as possible, we preferentially embed it into the orthogonal complement of the model’s representation function space. Our current implementation selects orthogonal convolution kernels before training; since orthogonal kernels may not perfectly match the characteristics of the trigger samples, future work can introduce automated kernel selection/optimization modules and explore additional optimization techniques to further reduce the overhead. Notably, although the impact of OFSW on the model is typically negligible, it is intrinsically an intrusive watermark. Motivated by this, we plan to explore non-intrusive watermarking schemes in future work to achieve verifiability and traceability with zero or minimal modifications to the protected model.