Next Article in Journal
Origami Fresnel Zone Plate Lens Reflector Antennas for Satellite Applications
Previous Article in Journal
The Impact of Security Protocols on TCP/UDP Throughput in IEEE 802.11ax Client–Server Network: An Empirical Study
Previous Article in Special Issue
Joint Event Detection with Dynamic Adaptation and Semantic Relevance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing the Sustained Capability of Continual Test-Time Adaptation with Dual Constraints

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(19), 3891; https://doi.org/10.3390/electronics14193891
Submission received: 1 September 2025 / Revised: 28 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025
(This article belongs to the Special Issue Advances in Social Bots)

Abstract

Continuous Test-Time Adaptation aims to adapt a source model to continuously and dynamically changing target domains. However, previous studies focus on adapting to each target domain independently, treating them as isolated, while ignoring the interplay of interference and promotion between domains, which limits the model’s sustained capability, often causing it to become trapped in local optima. This study highlights this critical issue and identifies two key factors that limit the model’s sustained capability: (1) The update of parameters lacks constraints, where domain-sensitive parameters capture domain-specific knowledge, leading to unstable channel representations and interference from old domain knowledge and hindering the learning of domain-invariant knowledge. (2) The decision boundary lacks constraints, and distribution shifts, which carry significant domain-specific knowledge, cause features to become dispersed and prone to clustering near the decision boundary. This is particularly problematic during the early stages of domain shifts, where features are more likely to cross the boundary. To tackle the two challenges, we propose a Dual Constraints method: First, we constrain updates to domain-sensitive parameters by minimizing the representation changes in domain-sensitive channels, alleviating the interference among domain-specific knowledge and promoting the learning of domain-invariant knowledge. Second, we introduce a constrained virtual decision boundary, which forces features to move away from the original boundary, and with a virtual margin to prevent features from crossing the decision boundary due to domain-specific knowledge interference caused by domain shifts. Extensive benchmark experiments show our framework outperforms competing methods.

1. Introduction

Continual Test-Time Adaptation (CTTA) focuses on enhancing machine learning models’ ability to adapt continuously in dynamic environments where input data distributions shift over time. This capability is crucial in various real-world scenarios. For instance, in autonomous driving [1,2,3,4], a vehicle may encounter changing conditions such as transitioning from daylight to nighttime or from sunny to rainy weather. To maintain high performance, models must adapt effectively to these evolving data distributions through Continual Test-Time Adaptation.
Numerous methods have been developed to tackle the challenges of Continual Test-Time Adaptation, including the use of teacher–student models [5,6,7], data augmentation techniques [5], semi-supervised learning [8], Low-Rank Learning [9], sample replay [10], and Masked Autoencoders [11].
Although previous studies have achieved significant success, a key limitation must be pointed out: These studies still follow the approach of Test-Time Adaptation (TTA) [12], treating the problem of Continual Test-Time Adaptation in isolation and focusing solely on adapting to each domain independently. This limits the model’s sustained capability, similar to a greedy algorithm [13], causing the model’s performance to become trapped in local optima [14] and failing to achieve the ideal global optimum [15]. This is because these methods overlook the essential nature of Continual Test-Time Adaptation: in this setting, domain adaptation is dynamic, ongoing, and continuously evolving, resulting in different domains either interfering with or promoting each other.
Therefore, addressing the following two challenges is crucial for enhancing the sustained capability of Continual Test-Time Adaptation: (1) How to effectively alleviate the mutual interference of domain-specific knowledge. (2) How to effectively learn domain-invariant knowledge across domains.
This study revisits the differences between Continual Test-Time Adaptation and Test-Time Adaptation, focusing on the long-neglected issue of sustained capability, and identifies two key factors that severely limit the model’s sustained capability and prevent it from getting trapped in local optima:
  • Lack of constraints in parameter updates: The model contains a large number of domain-sensitive parameters, which tend to learn domain-specific knowledge. When the model encounters a new domain, the distribution difference between the old and new domains causes these domain-sensitive parameters to behave abnormally, leading to unstable channel representations that incorporate substantial domain-specific knowledge from previous domains, especially in the early stages of adapting to the new domain. This results in severe interference between domain-specific knowledge. More importantly, relying on domain-specific knowledge for classification significantly hinders the learning of domain-invariant knowledge.
  • Lack of constraints in decision boundary: Domain shifts carry a significant amount of domain-specific knowledge, causing the features generated by the model to spread out more. The lack of constraints makes these features prone to clustering near or crossing the decision boundary. Under the interference of domain-specific knowledge, features near the decision boundary are more likely to cross, particularly in the early stages of the new domain.
As shown in Figure 1, we analyze the channel representations under three different corruption levels (level-5, level-3, and level-1) of CIFAR10C (The method for calculating unstable channel representations is provided in Section 4.4.3). In Figure 1a, we observe that different levels of corruption have varying impacts on the stability of channel representations. The higher the level of corruption, the more unstable the channel representations become. The domain-sensitive parameters are highly sensitive to this domain-specific knowledge, resulting in anomalies and instability in the channel representations.
In Figure 1b, we observe that domain-sensitive channels are much more responsive to domain shifts than domain-robust channels. When encountering different domains, the representation of these channels tends to exhibit instability and are prone to significant variations. The cause of this abnormal behavior lies in the domain-sensitive parameters, which tend to overfit domain-specific knowledge such as background, lighting, and texture. When a domain shift occurs, this domain-specific knowledge undergoes drastic changes, leading to significant variations in the channel representations. The instability of channel representations and the presence of a large amount of domain-specific knowledge can cause significant interference in domain adaptation, especially in the early stages of domain shift. In contrast, domain-robust parameters focus on learning domain-invariant knowledge, such as contours and shapes. This domain-invariant knowledge remains unaffected by domain shifts, resulting in more stable channel representations.
Based on our observation in Figure 1, we can see that domain-specific channels exhibit instability and are prone to change. Therefore, we propose a parameter update constraint method that estimates the relationship between changes in channel representations and the loss increases caused by parameter updates, suppressing domain-sensitive parameters by minimizing the changes in domain-sensitive channel representations. This constraint not only effectively alleviates the interference caused by domain-specific knowledge but also promotes the learning of domain-invariant knowledge by reducing the sensitivity of the parameters. Additionally, we provide theoretical evidence that our method can effectively enhance the model’s generalization ability and promote the learning of domain- invariant knowledge.
Furthermore, as shown in Figure 2, domain shifts carry substantial domain-specific knowledge, and when the model has not yet adapted to the new domain, features are more susceptible to the effects of the domain shift, making them more likely to cross the original decision boundary. We introduce a virtual decision boundary that constrains the features generated by the model to move away from the original decision boundary, preventing them from clustering near it. This constraint also creates a virtual margin between two decision boundaries. During domain shifts, this virtual margin provides sufficient buffering to prevent features from crossing the original decision boundary. The strongly constrained virtual decision boundary effectively mitigates the interference caused by domain-specific knowledge in the early stages of domain adaptation.
Overall, we propose a Dual Constraints method that combines channel-based parameter constraints and feature-based virtual decision boundary constraints, effectively addressing the two major challenges of domain knowledge interference and learning domain-invariant knowledge, thereby enhancing the model’s sustained capability(More motivations and details will be elaborated in Section 3.1). Our contributions can be summarized as follows:
  • We propose a novel parameter constraint method that minimizes the representation changes in domain-sensitive channels, which, respectively, enhance and suppress the learning of domain-invariant and domain-specific knowledge. In addition, we theoretically prove that it can effectively enhance the model’s generalization ability.
  • We introduce a strongly constrained virtual decision boundary that creates a virtual margin, forcing features away from the original decision boundary, effectively mitigating the problem of features crossing the boundary during domain shifts.
  • Dual Constraints enhance the model’s sustained capability and achieve excellent performance, surpassing all existing state-of-the-art methods.

2. Related Work

2.1. Unsupervised Domain Adaptation

Unsupervised Domain Adaptation (UDA) [16] assumes that there is a domain shift between the source domain and the target domain, with labeled data available in the source domain but no labels in the target domain [17]. The goal is for the model to perform well on the unlabeled target domain data. UDA methods often use distribution distance metrics to align the feature distributions between the source and target domains during training. For example, Maximum Mean Discrepancy (MMD) [18,19,20] is a statistical test that assesses whether two distributions are equal based on observed samples from those distributions. Adversarial training [21,22,23] is another common approach, which involves aligning distributions using two adversarial roles: a domain classifier and a feature generator. Unlike traditional Unsupervised Domain Adaptation methods, Test-Time Adaptation (TTA) aims to adapt a model trained on a source domain to a new target domain without accessing the original source data during inference, and many methods have been proposed to solve TTA, such as TENT [12] and SHOT [24]. TENT [12] updates the trainable batch normalization parameters from a pretrained model by minimizing the entropy of the model’s predictions during testing. SHOT [24] combines entropy minimization and diversity regularization with label smoothing techniques to train a general feature extractor from the pretrained source model.

2.2. Domain Generalization

Domain generalization (DG) aims to extract knowledge from the source domain that can generalize well to unseen target domains. Many methods learn domain-invariant representations by aligning the distributions of the target and source domains. These methods include adversarial learning [25], causal learning [26], and meta-learning [27]. Another approach is to enhance the model’s generalization capability by generating more source domain data, specifically by augmenting the diversity of the source data through data augmentation [28]. These data augmentation techniques primarily involve style transfer [28,29] and pixel-level augmentation [30]. Although these methods have demonstrated promising results, they may still learn excessive domain-specific features, as they rely on implicit assumptions to remove domain-specific characteristics via image-level augmentation or model-level constraints. Some studies have pointed out that Convolutional Neural Networks (CNNs) tend to classify objects based on local texture features that contain domain-specific characteristics [31,32,33]. To address this, they propose using penalty loss functions to suppress the model from learning local features such as texture, background, and lighting. This penalty encourages the model to rely on global features for classification. In contrast to local features, global features, including shape and contours, remain more stable and less prone to changes across domains. Therefore, this penalty on local features forces the model to learn domain-invariant properties, thereby enhancing its generalization ability.

2.3. Continual Learning

Continual learning (CL) aims to enable models to learn new tasks from a continuously evolving data stream while retaining previously acquired knowledge. One of the main challenges is catastrophic forgetting, where models tend to forget prior knowledge when learning new tasks. To address this issue, several approaches have been proposed, including regularization methods [34], which protect important weights from excessive updates, architecture expansion methods [35], which adapt the model by expanding its structure to accommodate new tasks, and memory replay methods [36,37,38,39], which store and replay data from previous tasks to mitigate forgetting. Despite significant progress, continual learning still faces challenges related to computational and memory overhead and maintaining strong generalization performance under non-stationary data distributions.

2.4. Continual Test-Time Adaptation

Unlike traditional Test-Time Adaptation (TTA) [40,41,42], which assumes a fixed target domain and works in a source-free online manner, CTTA accounts for the dynamic nature of real-world data distributions. This means CTTA requires models to adapt online and source-free across evolving domains. CoTTA [5] is a leading approach in continual domain learning, using average teachers and data augmentation for reliable pseudo-labels and robust model updates. DSS [8], inspired by semi-supervised learning, employs FreeMatch [43] to generate label thresholds for filtering pseudo-labels. RoTTA [44] stabilizes Batch Normalization updates to mitigate batch size variations during adaptation. RMT [6] addresses asymmetry in cross-entropy in teacher–student models, proposing symmetric cross-entropy for better gradients. Reshaping [10] addresses the catastrophic forgetting problem using sample replay. MAE [11] leverages mask autoencoders to learn domain-invariant knowledge. Unlike previous work, this study focuses on the sustainability of continuous domain adaptation, aiming to find the global optimum rather than settling for local optima.

3. Method

3.1. Problem Setting

Given a pretrained model M 0 with initial parameters θ 0 , originally trained on source domain data ( X S , Y S ) , the objective of CTTA is to iteratively adapt model M to a sequence of target domain datasets. During this process, the source domain data ( X S , Y S ) is not accessible, and the target domain datasets { X 0 T , X 1 T , , X n T } are unlabeled. At each time step t, the parameters θ t are updated to θ t + 1 to better align the model M t + 1 with the current target domain X t T . For the sake of clarity, we define the model output as: y ^ = M ( x ) = G ( Z ) = G ( F ( x ) ) . F ( x ) is the feature extractor, and the feature map at the t h layer is Z = { Z , 1 , Z , 2 , , Z , C } = F 1 ( Z 1 ) , where Z R C × H × W . G ( · ) represents the classifier, and W = { W 1 , W 2 W C } denotes the weights of the classifier.

3.2. Domain-Sensitive Parameter Suppression

3.2.1. Motivation of Domain-Sensitive Parameter Suppression

Existing methods typically use the following objective function to optimize model parameters for domain adaptation during testing:
min M t + 1 L ( M t + 1 ( X t T ) , Y ^ t T )
where at time t, the target domain data X t T is encountered, and the model parameters are updated based on the loss function L ( · ) to form the new model M t + 1 . It is important to note that before any parameter update at time t, M t + 1 = M t .
Existing methods naively use Equation (1) as the objective function, leading to an unconstrained parameter update process. A large amount of domain-sensitive parameters are highly sensitive to domain-specific knowledge and change rapidly to fit such knowledge [31,45], such as background, lighting, and texture. Although existing methods introduce EMA [46], where M t = β M t 1 + ( 1 β ) M t , and set β = 0.999 to alleviate performance degradation caused by rapid parameter changes, this parameter weighting approach has several issues: historical parameters contain a large amount of knowledge from past domains, which interferes with the current domain adaptation process, and hyperparameters cannot be adjusted adaptively. Moreover, this method is often used as a performance-boosting technique: people know that using it improves performance, but not using it worsens performance, without exploring the essence of the problem.

3.2.2. Implementation of Domain-Sensitive Parameter Suppression

To tackle this challenge, we propose a novel parameter constraint method that minimizes the representation changes in domain-sensitive channels, effectively mitigating the rapid updates and overfitting of domain-sensitive parameters to domain-specific knowledge.
Now, we consider two consecutive models, M t and M t 1 . Theoretically, the parameter difference between two adjacent models is very small, but CTTA is a continuous process where the model continuously adapts to the target domain, which will eventually result in a huge cumulative effect. To analyze the feature maps change caused by the model parameter changes between adjacent moments, let Z and Z are the feature maps generated by the 1 t h layer of M t ( · ) and M t 1 ( · ) , respectively. We can compute the loss given by Z via the first-order Taylor approximation as follows:
L ( G ( Z ) , y ^ ) L ( G ( Z ) , y ^ ) + c = 1 C Z , c L ( G ( Z ) , y ^ ) , Z , c Z , c F
where · , · F is the Frobenius inner product, i.e., A , B F = tr ( A B ) , and Z , c L ( G ( Z ) , y ^ ) is the gradient in the backpropagation of M t 1 ( · ) . Using Equation (2), we define the loss increment L ( Z , c ) for the c t h channel at the t h layer caused by parameter changes as follows:
L ( Z , c ) : = Z , c L ( G ( Z ) , y ^ ) , Z , c Z , c F
Thus, we aim to minimize the loss change caused by parameter updates between two adjacent models at consecutive moments, as follows:
min M t + 1 = 1 L c = 1 C E ( x ) X t T L ( Z , c ) 2
We need to update M t ( · ) to M t + 1 ( · ) at time t using data X t T . If we minimize Equation (4) using standard stochastic gradient descent, in addition to calculating the gradients of the feature maps produced by M t ( · ) with respect to the samples, we also need to calculate and store the gradients of the feature maps produced by M t 1 ( · ) with respect to the samples, which significantly increases computational cost or memory overhead.
Using the Cauchy-Schwarz inequality, we derive the upper bound for the objective function of Equation (4) as follows:
E L ( Z , c )
= E Z , c L ( G ( Z ) , y ^ ) , Z , c Z , c F
E Z , c L ( G ( Z ) , y ^ ) F · Z , c Z , c F
E Z , c L ( G ( Z ) , y ^ ) F 2 · E Z , c Z , c F 2
By optimizing the upper bound of the objective function, we avoid storing large amounts of the gradients of the feature maps or performing additional backpropagation. Substituting Equation (5) into the objective function to minimize Equation (4), we obtain
min M t + 1 = 1 L c = 1 C E x X t T Z , c L ( G ( Z ) , y ^ ) F 2 · Z , c Z , c F 2
According to Figure 3, we observe an interesting phenomenon: The degree of unstable representation of the channels is positively correlated with the magnitude of the gradient values. Larger gradients cause the parameters to change rapidly, indicating that domain-sensitive parameters are sensitive to domain-specific knowledge and fit such knowledge through rapid changes, leading to unstable channel representations that are prone to variations and contain a large amount of domain-specific knowledge. This is consistent with the conclusion derived from the Equation (6): the larger the gradient value, the higher the degree of suppression of the parameters.
Therefore, we define a channel sensitivity importance weight as I , c t :
I , c t = Z , c L ( G ( Z ) , y ^ ) F 2
However, we cannot directly optimize using this equation. There are significant differences in the magnitude of channel gradients across layers, so it is necessary to balance the scale of importance across layers. Finally, we construct the channel sensitivity importance weight I , c t as follows:
I , c t = I , c t = 0 t = 0 I , c t = I , c t 1 + Z , c L ( G ( Z ) , y ^ ) F 2 1 C , c c = 1 C Z , c L ( G ( Z ) , y ^ ) F 2 t > 0
Therefore, the final parameter constraint objective function can be expressed as
L D S P = min M t + 1 = 1 L c = 1 C E x X t T I , c t · Z , c Z , c F 2

3.2.3. Theoretical Analysis of Domain-Sensitive Parameter Suppression

In this section, we prove from the perspective of Lipschitz continuity that our method can effectively suppress domain-sensitive parameters, promote the learning of domain-invariant knowledge, and enhance the generalization ability of the model.
We first provide the definition of Lipschitz continuity. Given Ω R n , let θ 1 Ω and θ 2 Ω . For a function h : Ω R m , if there exists a constant K such that the following holds:
h ( θ 1 ) h ( θ 2 ) 2 K θ 1 θ 2 2 , θ 1 , θ 2 Ω
then, h is called Lipschitz continuous.
According to existing studies [33,47], if the loss function has a smaller Lipschitz constant K, it indicates that the loss function landscape is flatter, which consequently leads to better model generalization. On the contrary, if the Lipschitz constant K is very large, it indicates that the model parameters are highly sensitive to input variations, with even minor changes leading to drastic shifts in the model’s output, which means that the model lacks generalization ability.
Now, consider the model parameters θ t 1 and θ t at two consecutive moments during continual domain adaptation. The change in the loss function can be expressed as:
L ( θ t ) L ( θ t 1 ) 2 = L ( ζ ) ( θ t θ t 1 ) 2
where ζ = c θ t + ( 1 c ) θ t 1 , and c [ 0 , 1 ] . By applying the Cauchy-Schwarz inequality, we have
L ( θ t ) L ( θ t 1 ) 2 L ( ζ ) 2 θ t θ t 1 2
Considering θ t = θ t 1 η θ t 1 , we have θ t θ t 1 , so L ( ζ ) L ( θ t 1 ) , and the above can be rewritten as
L ( θ t ) L ( θ t 1 ) 2 L ( θ t 1 ) 2 θ t θ t 1 2
From Equation (13), it can be seen that minimizing L ( θ t 1 ) 2 is equivalent to minimizing the Lipschitz constant K. According to Equation (9), we have
L D S P I , c t = Z , c L ( G ( Z ) , y ^ ) F 2
Thus, based on Equation (14), penalizing the gradient norm forces the model parameters to generate smaller gradient norms, which is equivalent to reducing the Lipschitz constant of the model. As a result, the model achieves better generalization performance.

3.3. Virtual Decision Boundary

3.3.1. Motivation of Virtual Decision Boundary

In Figure 4, we clarify the reasons behind the lack of robustness in the decision boundary and present our proposed solutions. In Figure 4a, we observe that during domain adaptation, the features extracted by the feature extractor F ( · ) tend to cluster near or slightly cross the decision boundary, which is indicated by the red boxes. In Figure 4b, we examine domain shifts, where we assume that two successive adapted domains have deviations ε k and ε k + 1 from the source domain. The deviation between them is Δ ε = ε k + 1 ε k . When Δ ε > 0 , the feature bias increases, leading to the possibility that features clustered near the decision boundary may cross over, resulting in errors. To address these challenges, we propose the virtual decision boundary. As shown in Figure 4c, this method introduces a virtual margin between the original decision boundary and a newly created virtual decision boundary. This virtual margin pushes features away from the original boundary, reducing the likelihood of clustering near it. Additionally, the virtual margin provides sufficient buffer space to help prevent features from crossing the decision boundary during domain shifts.

3.3.2. Implementation of Virtual Decision Boundary

Before addressing this issue, we start with a simple binary classification problem. Consider a binary classification problem where we have a sample from class 1 and use the feature extractor function Z = F ( x ) to obtain the features of the sample x. We introduce a parameter m that is used to scale the inequality, forming a stricter virtual decision boundary. This can be mathematically expressed as W 1 2 Z 2 cos ( m θ 1 ) > W 2 2 Z 2 cos ( θ 2 ) ( 0 θ 1 π m ) , where θ is the angle between the classifier vector and the feature vector, and m [ 1 , + ) , as the following inequality holds:
W 1 2 Z 2 cos ( θ 1 ) W 1 2 Z 2 cos ( m θ 1 ) > W 2 2 Z 2 cos ( θ 2 )
Therefore, W 1 2 Z 2 cos ( θ 1 ) > W 2 2 Z 2 cos ( θ 2 ) has to hold. So the new classification criteria is a stronger requirement to correctly classify x, producing a more rigorous virtual decision boundary for class 1. Thus, the loss function for binary classification of class 1 can be written as
L = e W 1 2 Z 2 cos ( m θ 1 ) e W 1 2 Z 2 cos ( m θ 1 ) + e W 2 2 Z 2 cos ( θ 2 )
We can also apply the same constraint to class 2 as we did for class 1. In CTTA, we can transform a binary classification problem into a multiclass classification problem.
L = 1 | N | i = 1 N e W c 2 Z i 2 cos ( m θ c ) e W c 2 Z i 2 cos ( m θ c ) + j c C e W j 2 Z i 2 cos ( θ j )
In addition, in order to make Equation (17) hold for all θ π , we construct ψ ( θ ) as follows:
ψ θ = cos ( m θ ) 0 < θ < π m D ( θ ) π m < θ < π
where m is a non-negative number that is closely related to the classification margin. With larger m, the classification margin becomes larger, and the learning objective also becomes harder. Meanwhile, D ( θ ) is required to be a monotonically decreasing function and D ( π m ) should equal cos ( π m ) . We construct a specific ψ ( θ ) as follows:
ψ ( θ ) = ( 1 ) k cos ( m θ ) 2 k , θ [ k π m , ( k + 1 ) π m ]
where k [ 0 , m 1 ] and k is an integer. So, the final virtual decision boundary loss can be represented as follows:
L V D B = 1 | N | i = 1 N e W c 2 Z i 2 ψ ( θ c ) e W c 2 Z i 2 ψ ( θ c ) + j c C e W j 2 Z i 2 cos ( θ j )

3.3.3. Dynamic Virtual Margin of Virtual Decision Boundary

In the virtual decision boundary method, the width of the virtual margin m is not uniform across different domains and classes at any given time. This variability arises because different domains experience varying degrees of domain shift, and each class is affected differently, leading to varying densities of sample features near the decision boundary. To address this, we propose a dynamic margin that adjusts the width of the virtual margin according to the shift degree of each domain and class. A practical way to implement this is by using the classifier from the source model as a reference point. The deviation of each class from its corresponding class in the source domain can be quantified using cosine distance. Additionally, the overall domain shift can be assessed by calculating the average cosine distance of all samples from their positions in the source domain. Specifically, we first measure the shift degree D ¯ of the current domain relative to the source domain:
D ¯ = 1 2 1 1 | N | i N W c · Z i W c 2 Z i 2
Then, we measure the shift degree D c of each class relative to the corresponding class in the source domain:
D c = 1 2 1 1 | N c | j N c W c · Z j W c 2 Z j 2 , c C
Thus, at any given time t, the dynamic margin for each class in different domains can be expressed as
m c t = m t = 0 β · m c t 1 + 1 β · D c max ( D ) · e D ¯ + 1 , t > 0
where D = { D 1 , D 2 , , D C } , and β is set to 0.999 , used to robustly update, preventing its value from undergoing drastic changes. This dynamic adjustment allows the model to more effectively adapt to the specific shifts encountered across different domains and classes, enhancing its robustness and accuracy. The m is a hyperparameter used to initialize the virtual margin, and the specific setting is discussed in detail in the experimental section.

3.4. Loss Function

Based on Equations (1), (8) and (20), we can derive the following overall loss function:
L a l l = L ( M ( X T ) , Y ^ T ) + L D S P + L V D B
We constrain the model using this function to enhance its sustained capability.

4. Experiment and Results

4.1. Experimental Setup

4.1.1. Datasets and Task Setting

Building on the previous works [5,6,44], our method undergoes evaluation on three classification CTTA benchmarks, which encompass CIFAR10-to-CIFAR10C, CIFAR100-to-CIFAR100C, and ImageNet-to-ImageNetC. In the segmentation CTTA, we conduct assessments on the Cityscapes-to-ACDC, using the Cityscapes [48] as the source domain and the ACDC [49] as the target domain.

4.1.2. Compared Methods and Implementation Details

We compare our method with the original model (Source) and multiple CTTA methods, including BN [50], TENT [12], CoTTA [5], RoTTA [44], SATA [51], RMT [6], PETAL [7], DSS [8], Reshaping [10]. For the classification task, all methods are implemented using the same backbone architecture and pretrained model as used in our approach. Specifically, we utilize the pretrained WideResNet-28 [52] for CIFAR10C, ResNeXt-29 [53] for CIFAR100C, and ResNet-50 [54] for ImageNetC, and use the largest corruption severity (level 5). For the segmentation CTTA task, we use the ACDC dataset as the target domain, which includes images captured under four distinct, unobserved visual conditions: Fog, Night, Rain, and Snow. To simulate continuous environmental changes akin to real-world scenarios, we cyclically iterate through the same sequence of target domains (Fog → Night → Rain → Snow) multiple times.

4.2. Classification CTTA Tasks

4.2.1. CIFAR10-to-CIFAR10C Gradual and Continual

For the CIFAR10-to-CIFAR10C task, we evaluate our methods under two distinct settings. The first setting is a gradual task; the model sequentially adapts to fifteen target domains where the corruption severity level gradually changes between the lowest and highest extremes. The corruption type only changes when the severity reaches its lowest point. As shown in Table 1, our method achieves the lowest error rate of 8.3%, representing 2.1% improvement over the CoTTA method.
The second setting is a standard continual task where the model sequentially adapts to fifteen target domains, each with a corruption severity level of 5. The results, shown in Table 2, indicate that directly applying the source domain model yields an average error rate of 43.5%. The BN [50] method improves this performance by 23.1% compared with the source-only baseline. Among all compared methods, DSS achieves the lowest error rates of 12.2% on motion. SATA [51] achieves the lowest error rates of 10.2%, 14.1%, 13.2%, 10.3% on zoom, snow, frost, and contrast, respectively. Reshaping [10] achieves the lowest error rates of 17.1%, 12.7%, 15.9% on elastic, pixelate, and jpeg. In other scenarios, our proposed method either outperforms or is on par with the other approaches, ultimately achieving the lowest overall average error rate, reduced to 14.5%.

4.2.2. CIFAR100-to-CIFAR100C

The results for the CIFAR100-to-CIFAR100C continual task, as shown in Table 3, further demonstrate the effectiveness of our method. Our approach achieves the lowest error rates across the Gaussian, shot, impulse, glass, motion, zoom, snow, frost, elastic, and jpeg, and it also obtains the lowest overall average error rate. Compared with the source-only baseline, our method improves performance by 17.9%, and it surpasses the Reshaping [10] method with a further 1.2% reduction in error rate.

4.2.3. ImageNet-to-ImageNetC

Table 4 presents the performance comparison for various methods on the challenging ImageNet-to-ImageNetC continual task. Our method stands out by achieving the lowest average error rate among all the methods evaluated. Notably, it significantly outperforms the recently proposed Reshaping method across several difficult corruption types, including Gaussian (72.2% vs. 78.5%), shot (70.7% vs. 75.3%), impulse (68.3% vs. 73.0%), and glass (71.3% vs. 73.1%).

4.3. Semantic Segmentation CTTA Task

Cityscapes-to-ACDC

We validate the effectiveness of our approach in the more challenging segmentation CTTA task by adapting the pretrained Segformer model from the Cityscapes dataset to the ACDC dataset, as shown in Table 5. Our method outperforms the previous entropy minimization method (TENT [12]), teacher–student method (CoTTA [5]), and sample reshape method (Reshaping [10] by 9.7%, 4.4%, and 0.4%, respectively). Notably, our method demonstrates better stability compared with others, with performance continuously improving throughout the adaptation process. This is attributed to the effective suppression of domain-sensitive parameters, which forces the model to learn more domain-invariant knowledge while mitigating the interference from domain-specific knowledge.

4.4. Ablation Study and Further Analysis

4.4.1. Ablation Study

We conducted an ablation study to assess the effectiveness of the key components of our approach across three benchmarks. For clarity, we refer to the virtual decision boundary as VDB and Domain Sensitivity Parameter Suppression as DSP. As shown in Table 6, incorporating VDB and DSP results in reduced error rates across all benchmarks. The combination of VDB and DSP leads to an even greater reduction in error rates, highlighting the synergistic effect of these components when used together.

4.4.2. Integration with Existing Methods

Next, we integrate our method with existing methods, namely TENT [12], CoTTA [5], RMT [44], and DSS [8]. The experiments are conducted on the CIFAR10C and CIFAR100C datasets. Utilizing the official code of each method, we enhance the accuracy of all methods, as shown in Table 7. For example, our method reduces the error rate of CoTTA from 16.2 % to 14.2 % on CIFAR10C, from 32.5 % to 26.9 % on CIFAR100C, and from 62.6 % to 57.8 % on ImageNetC. These experiments and results demonstrate that our method can be seamlessly integrated with other CTTA methods to enhance performance.

4.4.3. Analysis of Domain Sensitivity Parameter Suppression

In this section, we focus on evaluating the effectiveness of Domain Sensitivity Parameter Suppression (DSP). The dataset used is CIFAR10C, and the pretrained network is WideResNet-28. The methods compared include CoTTA [5], DSS [8], and RMT [6]. First, we measure the model’s sensitivity to the domain by calculating the unstable representations of all channels in the network. The calculation formula is S t d c h a n n e l s = l = 0 L i = 0 C l C i l ˜ C i l 2 2 where C i l ˜ represent the channel features generated by the source domain data in the target domain network, and C i l represent the channel features generated by the target domain data in the target domain network. As shown in Figure 5, our method consistently achieves the lowest unstable activation values across all domains. This indicates that DSP effectively suppresses the learning of domain-specific knowledge by regulating the updates of domain-sensitive parameters. This, in turn, enhances the learning of domain-invariant knowledge and prevents the model from overfitting to current domain knowledge during continuous domain shifts, thereby mitigating interference with future domains that may be encountered. Furthermore, we examined the relationship between the number of channels in the first layer of the network and their corresponding unstable activation values. As shown in Figure 6, the unstable representations of DSP exhibit a smaller variance, with activation values concentrating around 0.35 across all channels. This demonstrates that the introduction of DSP can effectively control channel instability, enabling the model to maintain strong robustness and effectively counteract the effects of domain shifts.
Additionally, inspired by adversarial training, we incorporate an auxiliary discriminator network to evaluate the effectiveness of DSP in promoting the learning of domain-invariant representations. Specifically, we add a source domain discriminator to the first layer of the WideResNet-28 network and train this discriminator on the CIFAR-10 dataset. The source domain discriminator takes the output of the first layer as input and generates a score indicating the likelihood that the input belongs to the source domain. Intuitively, if the domain-sensitive parameters are effectively suppressed, the model’s output should remain highly robust and well aligned with the source domain, regardless of the domain shifts. As shown in Figure 7, compared with other methods on CIFAR10C, our approach consistently maintains high robustness to the source domain across all domains, indicating that the domain-sensitivity parameters have been effectively suppressed.

4.5. Analysis of Dynamic Virtual Margin

In this section, we focus on discussing the parameter initialization of the virtual decision margin m and the superiority of dynamic virtual margins. First, we explore the effects of different initialization values for the parameters on three datasets (CIFAR10C, CIFAR100C, ImageNetC). As shown in Table 8, the initialization of the virtual decision margin parameters is highly robust. When m is set between 1 and 6, it performs well across all three datasets. However, when m exceeds 6, the performance begins to gradually decline. This is because a larger virtual margin increases the difficulty of model convergence during the initial phase of domain adaptation, leading to a drop in performance. We recommend initializing with a smaller virtual margin, allowing the margin to adaptively adjust to an appropriate value to avoid convergence difficulties caused by a large margin in the early stages of adaptation.
Overall, the dynamic virtual margin can adaptively adjust the margin width based on domain and class difficulty, and is not sensitive to initialization parameters, showing good robustness with excellent performance across a broad initialization range.
Next, we investigate the performance of fixed and dynamic virtual margins. As shown in Table 9, dynamic virtual margins outperform fixed virtual margins on all datasets. This is due to the dynamic margin’s ability to adaptively adjust the margin size based on domain and class difficulty, achieving optimal performance.

4.6. Analysis of Virtual Decision Boundary

We evaluate the effectiveness of our virtual decision boundary (VDB) method in mitigating feature crossing the decision boundary by analyzing inter-class and intra-class distances. Our VDB method is proposed to alleviate the issue of feature clustering at or slightly crossing the decision boundary. First, we utilized T-SNE to visualize the sample features generated by three different methods (DSS [8], CoTTA [5], RMT [6]) and our method in the Gaussian domain of the CIFAR10C dataset. As shown in Figure 8, the results demonstrate that the features generated by our method exhibit better clustering performance, with smaller intra-class distances and larger inter-class distances. Second, if class feature shifts are well controlled, the inter-class distance should increase, and the intra-class distance should decrease. This will effectively prevent features from crossing the decision boundary. The intra-class distance is expressed as d i n t r a = i C | | z i z i ¯ | | 2 , and the inter-class distance is expressed as d i n t e r = i C j i C | | z i z j ¯ | | 2 , where z ¯ represents the mean feature of the class. As shown in Figure 9 and Figure 10, we compare our approach with three other methods include DSS [8], CoTTA [5], and RMT [44]. The results demonstrate that our method excels in managing class shifts during CTTA, consistently achieving more desirable intra-class and inter-class distances across all domains. In particular, to evaluate the effectiveness of our virtual decision boundary in mitigating feature crossing the decision boundary during the early stages of domain adaptation, we counted the number of features crossing the decision boundary for each method at the early stage of domain adaptation. As shown in Figure 11, our method consistently outperforms others across all domains.

4.7. Time and Parameter Complexity Evaluation

In this section, we focus on analyzing the time and space complexity of our method compared with other methods. The datasets used include CIFAR10C, CIFAR100C, and ImageNetC, and the methods for comparison include TENT [12], CoTTA [5], DSS [8], and RMT [6]. As shown in Table 10, the time required by our method is only 3 min, 4 min, and 13 min more than the most time-consuming DSS [8] method on the CIFAR10C, CIFAR100C, and ImageNetC datasets, respectively. The reason for the additional time consumption is that our method requires the computation of the channel feature map from the previous model (without computing gradients) and the calculation of the sensitivity loss function. As shown in Figure 12, compared with CoTTA [5], DSS [8], and RMT [6], our method requires more GPU memory at runtime, approximately 0.8GB, due to the additional storage of feature maps computed by the previous model. This additional time and space complexity overhead are small and can be largely ignored, which leads to significant performance improvements.

5. Conclusions

Distinguishing itself from previous research that tends to concentrate on isolated domain adaptation and seeks local optima. Instead, this paper focuses on the often overlooked sustained capability in Continual Test-Time Adaptation. To address this issue, we propose a Dual Constraint method that combines parameter constraints based on channel representations and virtual decision boundary constraints based on features. The parameter constraint penalizes the model’s reliance on domain-specific knowledge, forcing it to learn domain-invariant knowledge while alleviating channel instability and mutual interference caused by overfitting to domain-sensitive parameters. This approach also theoretically enhances the model’s generalization ability. Additionally, we provide theoretical evidence that our method can effectively enhance the model’s generalization ability and promote the learning of domain-invariant knowledge. Meanwhile, the virtual decision boundary constraint pushes features away from the original decision boundary, forming a virtual margin that buffers against domain shifts, effectively reducing the mutual interference between domain-specific knowledge. Our extensive experiments on multiple CTTA benchmark datasets have demonstrated the effectiveness of our proposed methods.

Author Contributions

Conceptualization, P.L. and Y.W.; Methodology, Y.S. and P.L.; Software, P.L.; Investigation, Y.W.; Writing—review & editing, Y.S. and P.L.; Visualization, Y.W.; Supervision, Y.S.; Funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under the Grant No. 62002330.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hu, C.; Hudson, S.; Ethier, M.; Al-Sharman, M.; Rayside, D.; Melek, W. Sim-to-real domain adaptation for lane detection and classification in autonomous driving. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; IEEE: New York, NY, USA, 2022; pp. 457–463. [Google Scholar]
  2. Shi, Z.; Su, T.; Liu, P.; Wu, Y.; Zhang, L.; Wang, M. Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration. arXiv 2024, arXiv:2407.01636. [Google Scholar] [CrossRef]
  3. Chen, W.; Miao, L.; Gui, J.; Wang, Y.; Li, Y. FLsM: Fuzzy Localization of Image Scenes Based on Large Models. Electronics 2024, 13, 2106. [Google Scholar] [CrossRef]
  4. Xu, E.; Zhu, J.; Zhang, L.; Wang, Y.; Lin, W. Research on Aspect-Level Sentiment Analysis Based on Adversarial Training and Dependency Parsing. Electronics 2024, 13, 1993. [Google Scholar] [CrossRef]
  5. Wang, Q.; Fink, O.; Van Gool, L.; Dai, D. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7201–7211. [Google Scholar]
  6. Döbler, M.; Marsden, R.A.; Yang, B. Robust mean teacher for continual and gradual test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7704–7714. [Google Scholar]
  7. Brahma, D.; Rai, P. A probabilistic framework for lifelong test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3582–3591. [Google Scholar]
  8. Wang, Y.; Hong, J.; Cheraghian, A.; Rahman, S.; Ahmedt-Aristizabal, D.; Petersson, L.; Harandi, M. Continual test-time domain adaptation via dynamic sample selection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1701–1710. [Google Scholar]
  9. Liu, J.; Yang, S.; Jia, P.; Zhang, R.; Lu, M.; Guo, Y.; Xue, W.; Zhang, S. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv 2023, arXiv:2306.04344. [Google Scholar]
  10. Zhu, Z.; Hong, X.; Ma, Z.; Zhuang, W.; Ma, Y.; Dai, Y.; Wang, Y. Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation. In Proceedings of the European Conference on Computer Vision, Dublin, Ireland, 22–23 October 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 415–433. [Google Scholar]
  11. Liu, J.; Xu, R.; Yang, S.; Zhang, R.; Zhang, Q.; Chen, Z.; Guo, Y.; Zhang, S. Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28653–28663. [Google Scholar]
  12. Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully test-time adaptation by entropy minimization. arXiv 2020, arXiv:2006.10726. [Google Scholar]
  13. DeVore, R.A.; Temlyakov, V.N. Some remarks on greedy algorithms. Adv. Comput. Math. 1996, 5, 173–187. [Google Scholar] [CrossRef]
  14. Knowles, J.D.; Watson, R.A.; Corne, D.W. Reducing local optima in single-objective problems by multi-objectivization. In Proceedings of the International Conference on Evolutionary Multi-Criterion Optimization, Zurich, Switzerland, 7–9 March 2001; Springer: Berlin/Heidelberg, Germany, 2001; pp. 269–283. [Google Scholar]
  15. Wang, D.Z.; Lo, H.K. Global optimum of the linearized network design problem with equilibrium flows. Transp. Res. Part B Methodol. 2010, 44, 482–492. [Google Scholar] [CrossRef]
  16. Zhou, W.; Zhou, Z. Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8201–8214. [Google Scholar] [CrossRef]
  17. Wilson, G.; Cook, D.J. A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. (Tist) 2020, 11, 1–46. [Google Scholar] [CrossRef]
  18. Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: San Diego, CA, USA, 2015; pp. 97–105. [Google Scholar]
  19. Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Unsupervised domain adaptation with residual transfer networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
  20. Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: San Diego, CA, USA, 2017; pp. 2208–2217. [Google Scholar]
  21. Fan, X.; Wang, Q.; Ke, J.; Yang, F.; Gong, B.; Zhou, M. Adversarially adaptive normalization for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8208–8217. [Google Scholar]
  22. Yang, F.E.; Cheng, Y.C.; Shiau, Z.Y.; Wang, Y.C.F. Adversarial teacher-student representation learning for domain generalization. Adv. Neural Inf. Process. Syst. 2021, 34, 19448–19460. [Google Scholar]
  23. Zhu, W.; Lu, L.; Xiao, J.; Han, M.; Luo, J.; Harrison, A.P. Localized adversarial domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7108–7118. [Google Scholar]
  24. Liang, J.; Hu, D.; Feng, J. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: San Diego, CA, USA, 2020; pp. 6028–6039. [Google Scholar]
  25. Lowd, D.; Meek, C. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 641–647. [Google Scholar]
  26. Gopnik, A.; Glymour, C.; Sobel, D.M.; Schulz, L.E.; Kushnir, T.; Danks, D. A theory of causal learning in children: Causal maps and Bayes nets. Psychol. Rev. 2004, 111, 3. [Google Scholar] [CrossRef] [PubMed]
  27. Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-learning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5149–5169. [Google Scholar] [CrossRef] [PubMed]
  28. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
  29. Zhou, K.; Yang, Y.; Hospedales, T.; Xiang, T. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13025–13032. [Google Scholar]
  30. Gan, Y.; Bai, Y.; Lou, Y.; Ma, X.; Zhang, R.; Shi, N.; Luo, L. Decorate the newcomers: Visual domain prompt for continual test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7595–7603. [Google Scholar]
  31. Wang, H.; Ge, S.; Lipton, Z.; Xing, E.P. Learning robust global representations by penalizing local predictive power. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  32. Guo, J.; Qi, L.; Shi, Y. Domaindrop: Suppressing domain-sensitive channels for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 19114–19124. [Google Scholar]
  33. Zhao, Y.; Zhang, H.; Hu, X. Penalizing gradient norm for efficiently improving generalization in deep learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: San Diego, CA, USA, 2022; pp. 26982–26992. [Google Scholar]
  34. Cong, W.; Cong, Y.; Sun, G.; Liu, Y.; Dong, J. Self-Paced Weight Consolidation for Continual Learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2209–2222. [Google Scholar] [CrossRef]
  35. Zhang, W.; Huang, Y.; Zhang, W.; Zhang, T.; Lao, Q.; Yu, Y.; Zheng, W.S.; Wang, R. Continual Learning of Image Classes With Language Guidance From a Vision-Language Model. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13152–13163. [Google Scholar] [CrossRef]
  36. Yu, D.; Zhang, M.; Li, M.; Zha, F.; Zhang, J.; Sun, L.; Huang, K. Contrastive Correlation Preserving Replay for Online Continual Learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 124–139. [Google Scholar] [CrossRef]
  37. Li, H.; Liao, L.; Chen, C.; Fan, X.; Zuo, W.; Lin, W. Continual Learning of No-Reference Image Quality Assessment With Channel Modulation Kernel. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13029–13043. [Google Scholar] [CrossRef]
  38. Li, K.; Chen, H.; Wan, J.; Yu, S. ESDB: Expand the Shrinking Decision Boundary via One-to-Many Information Matching for Continual Learning With Small Memory. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7328–7343. [Google Scholar] [CrossRef]
  39. Shi, Z.; Liu, P.; Su, T.; Wu, Y.; Liu, K.; Song, Y.; Wang, M. Densely Distilling Cumulative Knowledge for Continual Learning. arXiv 2024, arXiv:2405.09820. [Google Scholar] [CrossRef]
  40. Wu, Y.; Chi, Z.; Wang, Y.; Plataniotis, K.N.; Feng, S. Test-time domain adaptation by learning domain-aware batch normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March; Volume 38, pp. 15961–15969.
  41. Zhang, J.; Qi, L.; Shi, Y.; Gao, Y. Domainadaptor: A novel approach to test-time adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 18971–18981. [Google Scholar]
  42. Chen, K.; Gong, T.; Zhang, L. Camera-Aware Recurrent Learning and Earth Mover’s Test-Time Adaption for Generalizable Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 357–370. [Google Scholar] [CrossRef]
  43. Wang, Y.; Chen, H.; Heng, Q.; Hou, W.; Fan, Y.; Wu, Z.; Wang, J.; Savvides, M.; Shinozaki, T.; Raj, B.; et al. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv 2022, arXiv:2205.07246. [Google Scholar]
  44. Yuan, L.; Xie, B.; Li, S. Robust test-time adaptation in dynamic scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15922–15932. [Google Scholar]
  45. Shi, B.; Zhang, D.; Dai, Q.; Zhu, Z.; Mu, Y.; Wang, J. Informative dropout for robust representation learning: A shape-bias perspective. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: San Diego, CA, USA, 2020; pp. 8828–8839. [Google Scholar]
  46. Klinker, F. Exponential moving average versus moving exponential average. Math. Semesterber. 2011, 58, 97–107. [Google Scholar]
  47. Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y. Sharp minima can generalize for deep nets. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: San Diego, CA, USA, 2017; pp. 1019–1028. [Google Scholar]
  48. Yang, S.; Wu, J.; Liu, J.; Li, X.; Zhang, Q.; Pan, M.; Zhang, S. Exploring sparse visual prompt for cross-domain semantic segmentation. arXiv 2023, arXiv:2303.09792. [Google Scholar]
  49. Sakaridis, C.; Dai, D.; Van Gool, L. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10765–10775. [Google Scholar]
  50. Li, Y.; Wang, N.; Shi, J.; Liu, J.; Hou, X. Revisiting batch normalization for practical domain adaptation. arXiv 2016, arXiv:1603.04779. [Google Scholar] [CrossRef]
  51. Chakrabarty, G.; Sreenivas, M.; Biswas, S. Sata: Source anchoring and target alignment network for continual test time adaptation. arXiv 2023, arXiv:2304.10113. [Google Scholar] [CrossRef]
  52. Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
  53. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  54. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Figure 1. An illustration of the relationship between channels and unstable representations: (a) The instability of channel representations increases with the level of corruption. This indicates that domain shifts impact the channel outputs, and the stronger the domain shift, the greater the instability of the channel outputs. (b) The instability of representations varies across different channels. Domain-robust channels exhibit stable representations with smaller variance, typically concentrated on the left, suggesting that these channels have learned more domain-invariant knowledge, making them resilient to data distribution shifts and less prone to changes. In contrast, domain-sensitive channels show unstable representations with larger variance, typically concentrated on the right, as they have learned domain-specific knowledge, making them vulnerable to data distribution shifts and more susceptible to changes.
Figure 1. An illustration of the relationship between channels and unstable representations: (a) The instability of channel representations increases with the level of corruption. This indicates that domain shifts impact the channel outputs, and the stronger the domain shift, the greater the instability of the channel outputs. (b) The instability of representations varies across different channels. Domain-robust channels exhibit stable representations with smaller variance, typically concentrated on the left, suggesting that these channels have learned more domain-invariant knowledge, making them resilient to data distribution shifts and less prone to changes. In contrast, domain-sensitive channels show unstable representations with larger variance, typically concentrated on the right, as they have learned domain-specific knowledge, making them vulnerable to data distribution shifts and more susceptible to changes.
Electronics 14 03891 g001
Figure 2. Decision boundary lacks constraints. Consider a domain shift occurring between two adjacent time points, t and t + 1 . Features that are clustered near the decision boundary at time t are subjected to a stronger domain shift, causing them to cross the decision boundary and manifest as features at time t + 1 .
Figure 2. Decision boundary lacks constraints. Consider a domain shift occurring between two adjacent time points, t and t + 1 . Features that are clustered near the decision boundary at time t are subjected to a stronger domain shift, causing them to cross the decision boundary and manifest as features at time t + 1 .
Electronics 14 03891 g002
Figure 3. Gradient norm and channel-unstable representation relationship.
Figure 3. Gradient norm and channel-unstable representation relationship.
Electronics 14 03891 g003
Figure 4. An illustrative example showcasing our motivation and solution: (a) During continuous domain adaptation, features are prone to cluster around or cross the decision boundary. (b) During domain shifts, features clustered at the decision boundary are prone to crossing it. (c) The virtual decision boundary forces the generated features away from the original decision boundary, creating a virtual margin that prevents features from crossing the decision boundary during domain shifts.
Figure 4. An illustrative example showcasing our motivation and solution: (a) During continuous domain adaptation, features are prone to cluster around or cross the decision boundary. (b) During domain shifts, features clustered at the decision boundary are prone to crossing it. (c) The virtual decision boundary forces the generated features away from the original decision boundary, creating a virtual margin that prevents features from crossing the decision boundary during domain shifts.
Electronics 14 03891 g004
Figure 5. Channels std in all domains.
Figure 5. Channels std in all domains.
Electronics 14 03891 g005
Figure 6. Channels std in first layer.
Figure 6. Channels std in first layer.
Electronics 14 03891 g006
Figure 7. Domain robustness score.
Figure 7. Domain robustness score.
Electronics 14 03891 g007
Figure 8. Visualization on the Gaussian domain of the CIFAR10C dataset.
Figure 8. Visualization on the Gaussian domain of the CIFAR10C dataset.
Electronics 14 03891 g008
Figure 9. Intra-class distance in all domains.
Figure 9. Intra-class distance in all domains.
Electronics 14 03891 g009
Figure 10. Inter-class distance in all domains.
Figure 10. Inter-class distance in all domains.
Electronics 14 03891 g010
Figure 11. Error numbers at the domain begin.
Figure 11. Error numbers at the domain begin.
Electronics 14 03891 g011
Figure 12. Memory usage for different methods on various datasets.
Figure 12. Memory usage for different methods on various datasets.
Electronics 14 03891 g012
Table 1. Classification error rate (%) for the gradual CIFAR10-to-CIFAR10C task. The best results in each column are highlighted in bold.
Table 1. Classification error rate (%) for the gradual CIFAR10-to-CIFAR10C task. The best results in each column are highlighted in bold.
DatasetSourceBN [50]TENT [12]CoTTA [5]Ours
CIFAR10C (Error %)24.813.730.710.48.3
Table 2. Classification error rate (%) for the standard CIFAR10-to-CIFAR10C Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
Table 2. Classification error rate (%) for the standard CIFAR10-to-CIFAR10C Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
GaussianShotImpulseDefocusGlassMotionZoomSnowFrostFogBrightnessContrastElasticPixelateJpegMean
Source72.365.772.946.954.334.842.025.141.326.09.346.726.658.530.343.5
BN [50]28.126.136.312.835.314.212.117.317.415.38.412.623.819.727.320.4
TENT [12]24.820.628.614.431.116.514.119.118.618.612.220.325.720.824.920.7
CoTTA [5]24.321.326.611.627.612.210.314.814.112.47.510.618.313.417.316.2
RoTTA [44]30.325.434.618.334.014.711.016.414.614.08.012.420.316.819.419.3
RMT [6]24.120.225.713.225.514.712.816.215.414.610.814.018.014.116.617.0
PETAL [7]23.721.426.311.828.812.410.414.813.912.67.410.618.313.117.116.2
SATA [51]23.920.128.011.627.412.610.214.113.212.27.410.319.113.318.516.1
DSS [8]24.121.325.411.726.912.210.514.514.112.57.810.818.013.117.316.0
Reshaping [10]23.619.926.011.825.313.210.914.313.512.79.011.917.112.715.915.8
Ours20.116.523.411.224.112.610.314.313.612.17.310.918.212.917.014.5
Table 3. Classification error rate (%) for the standard CIFAR100-to-CIFAR100C Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
Table 3. Classification error rate (%) for the standard CIFAR100-to-CIFAR100C Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
GaussianShotImpulseDefocusGlassMotionZoomSnowFrostFogBrightnessContrastElasticPixelateJpegMean
Source73.068.039.429.354.130.828.839.545.850.329.555.137.274.741.246.4
BN [50]42.140.742.727.641.929.727.934.935.041.526.530.335.732.941.235.4
TENT [12]37.235.841.737.951.248.348.558.463.771.170.482.388.088.590.460.9
CoTTA [5]40.137.739.726.938.027.926.432.831.840.324.726.932.528.333.532.5
RoTTA [44]49.144.945.530.242.729.526.132.230.737.524.729.132.630.436.734.8
RMT [6]40.236.236.027.933.928.426.428.728.831.125.527.128.026.629.030.2
PETAL [7]38.336.438.625.936.827.325.432.030.838.724.426.431.526.932.531.5
SATA [51]36.533.135.125.934.927.725.429.529.933.124.126.731.927.535.230.3
DSS [8]39.736.037.226.335.627.525.131.430.037.824.226.030.026.331.130.9
Reshaping [10]38.835.035.426.733.227.425.027.426.829.824.125.126.924.928.029.0
Ours33.932.631.825.430.226.724.828.924.329.723.425.126.523.330.627.8
Table 4. Classification error rate (%) for the standard ImageNet-to-ImageNetC Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
Table 4. Classification error rate (%) for the standard ImageNet-to-ImageNetC Continual Test-Time Adaptation task. The best results in each column are highlighted in bold.
GaussianShotImpulseDefocusGlassMotionZoomSnowFrostFogBrightnessContrastElasticPixelateJpegMean
Source95.395.095.386.191.987.477.985.179.979.045.496.286.677.566.183.0
BN [50]87.787.487.888.087.778.363.967.470.354.736.488.758.056.667.072.0
TENT [12]81.674.672.777.673.865.555.361.663.051.738.272.150.847.453.362.6
CoTTA [5]84.782.180.681.379.068.657.560.360.548.336.666.147.241.246.062.7
RoTTA [44]88.382.882.191.383.772.959.466.264.353.335.674.554.348.252.667.3
RMT [6]79.976.373.175.772.964.756.856.458.349.040.658.247.843.744.859.9
PETAL [7]87.485.884.485.083.974.463.163.564.052.440.074.051.745.251.067.1
SATA [51]74.172.971.675.774.164.255.555.662.946.636.169.950.644.348.560.1
DSS [8]84.680.478.783.979.874.962.962.862.949.737.471.049.542.948.264.6
Reshaping [10]78.575.373.075.773.164.556.055.858.147.638.558.546.142.043.459.0
Ours72.270.768.375.971.362.054.855.261.343.340.561.248.642.141.757.9
Table 5. Semantic segmentation results (mIoU in %) on the Cityscapes-to-ACDC CTTA task. The four test conditions are repeated ten times to evaluate the long-term adaptation performance. The best results in each column are highlighted in bold.
Table 5. Semantic segmentation results (mIoU in %) on the Cityscapes-to-ACDC CTTA task. The four test conditions are repeated ten times to evaluate the long-term adaptation performance. The best results in each column are highlighted in bold.
TimeElectronics 14 03891 i001
Round14710All
ConditionFogNightRainSnowMeanFogNightRainSnowMeanFogNightRainSnowMeanFogNightRainSnowMeanMean ↑
Source69.140.359.757.856.769.140.359.757.856.769.140.359.757.856.769.140.359.757.856.756.7
BN62.338.054.653.052.062.338.054.653.052.062.338.054.653.052.062.338.054.653.052.052.0
TENT [12]69.040.260.157.356.766.536.358.754.053.964.232.855.350.950.861.829.851.947.847.852.3
CoTTA [5]70.941.262.459.758.670.941.262.459.758.670.941.262.459.758.670.941.262.459.758.658.6
Reshaping [10]71.242.365.062.060.172.843.666.763.361.672.542.566.863.361.372.542.966.763.061.361.3
ours71.843.165.262.360.672.644.666.763.561.972.843.567.863.461.973.143.967.364.062.161.7
Table 6. Ablation: Contribution of our proposed VDB and DSP. The best results in each column are highlighted in bold.
Table 6. Ablation: Contribution of our proposed VDB and DSP. The best results in each column are highlighted in bold.
VDBDSPCIFAR10CCIFAR100CImageNetC
0 16.2%32.5%62.7%
1 15.4%30.1%60.3%
2 15.1%28.3%59.2%
314.5%27.8%57.9%
Table 7. Integration with existing methods. Our method can be seamlessly integrated with other CTTA methods to boost performance.
Table 7. Integration with existing methods. Our method can be seamlessly integrated with other CTTA methods to boost performance.
MethodCIFAR10CCIFAR100CImageNetC
TENT+ours 18.9 % + 1.8 % 56.6 % + 4.3 % 65.3 % + 6.7 %
CoTTA+ours 14.2 % + 2.0 % 26.9 % + 5.6 % 57.8 % + 4.8 %
RMT+ours 15.4 % + 1.6 % 28.7 % + 1.5 % 57.4 % + 2.5 %
DSS+ours 14.1 % + 1.7 % 26.1 % + 4.8 % 58.5 % + 6.1 %
Table 8. Performance results (error rate %) with different values of m on CIFAR10C, CIFAR100C, and ImageNetC datasets.
Table 8. Performance results (error rate %) with different values of m on CIFAR10C, CIFAR100C, and ImageNetC datasets.
m = 1m = 2m = 3m = 4m = 5m = 6m = 7m = 8
CIFAR10C14.814.514.714.915.315.516.116.8
CIFAR100C27.927.828.228.328.428.729.530.1
ImageNetC58.157.958.158.358.659.159.760.9
Table 9. Performance results (error rate %) with different values of fixed m and dynamic m on CIFAR10C, CIFAR100C, and ImageNetC datasets.
Table 9. Performance results (error rate %) with different values of fixed m and dynamic m on CIFAR10C, CIFAR100C, and ImageNetC datasets.
m = 1m = 2m = 3m = 4m = 5m = 6m = 7m = 8m = Dynamic
CIFAR10C15.315.214.915.515.916.316.717.314.5
CIFAR100C28.228.128.528.929.329.730.231.127.8
ImageNetC58.758.258.558.959.359.960.361.257.9
Table 10. Time required for different methods on CIFAR10C, CIFAR100C, and ImageNetC datasets.
Table 10. Time required for different methods on CIFAR10C, CIFAR100C, and ImageNetC datasets.
TENTCoTTADSSRMTOurs
CIFAR10C7 min15 min16 min15 min19 min
CIFAR100C9 min17 min19 min18 min23 min
ImageNetC33 min71 min73 min72 min86 min
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Y.; Liu, P.; Wu, Y. Enhancing the Sustained Capability of Continual Test-Time Adaptation with Dual Constraints. Electronics 2025, 14, 3891. https://doi.org/10.3390/electronics14193891

AMA Style

Song Y, Liu P, Wu Y. Enhancing the Sustained Capability of Continual Test-Time Adaptation with Dual Constraints. Electronics. 2025; 14(19):3891. https://doi.org/10.3390/electronics14193891

Chicago/Turabian Style

Song, Yu, Pei Liu, and Yunpeng Wu. 2025. "Enhancing the Sustained Capability of Continual Test-Time Adaptation with Dual Constraints" Electronics 14, no. 19: 3891. https://doi.org/10.3390/electronics14193891

APA Style

Song, Y., Liu, P., & Wu, Y. (2025). Enhancing the Sustained Capability of Continual Test-Time Adaptation with Dual Constraints. Electronics, 14(19), 3891. https://doi.org/10.3390/electronics14193891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop