1. Introduction
A primary challenge in the development of artificial intelligence systems is ensuring that models maintain reliable performance under conditions that differ from those observed during training [
1]. Such discrepancies may arise due to changes in the operational environment, variations in acquisition devices, or differences in user population characteristics [
2]. These shifts, though often subtle, can significantly impact model behavior and compromise generalization capabilities, even when the underlying task remains unchanged [
3]. This vulnerability becomes particularly critical in real-world applications, where it is infeasible to anticipate all possible future scenarios, thus limiting the scalability and trustworthiness of deployed solutions [
4]. In this context, validation within the source domain alone proves insufficient to guarantee consistent performance in heterogeneous settings, prompting the development of strategies to mitigate such discrepancies. Among these, domain adaptation has emerged as a key approach, enabling the reuse of pretrained models in new environments by aligning distributions across domains, thereby reducing the need for extensive data collection and annotation in the target domain [
5]. The latter not only enhances the efficiency of knowledge transfer, but also supports the creation of more robust and sustainable systems in dynamic and uncertain environments.
Despite the progress achieved through domain adaptation, the problem of generalizing to unseen domains remains only partially resolved. Domain shifts can take complex forms that go beyond marginal discrepancies, affecting the internal structure of learned representations and leading to systematic performance degradation in the target domain [
6]. Consequently, adapted models frequently exhibit degraded or inconsistent performance when deployed in unfamiliar environments, especially under shifts in input distributions that are structural and semantic in nature [
7]. This limitation arises primarily from the inability to preserve domain-invariant features under covariate shifts, where noise in input features, biased samples, or insufficient representations can degrade the alignment across domains and compromise the stability of the learned models [
8]. Second, generalization is further hindered when the learned features lack discriminative power, particularly in the presence of concept shift and noisy labels. These factors distort latent representations and decision boundaries, making it difficult to maintain semantic clarity in the target domain [
9]. Third, the absence of interpretability mechanisms impedes the reliable evaluation of whether predictions are based on meaningful semantic signals or on spurious correlations inherited from the source domain [
10]. Collectively, these challenges hinder the development of domain-adaptive systems that are accurate, robust, and interpretable.
In response to the challenges inherent in domain adaptation, numerous classical approaches have been proposed, most of which rely on linear transformations to align source and target distributions. These strategies aim to mitigate distributional discrepancies through statistical alignment techniques. Methods such as Correlation Alignment (CORAL) and Subspace Alignment (SA) reduce marginal discrepancy by aligning covariance matrices or projecting data onto orthonormal subspaces [
11,
12]. Despite their effectiveness under controlled conditions, their reliance on original feature spaces or linear projections makes them susceptible to distortions, noise, and domain-specific biases, hindering the extraction of invariant representations [
13]. To address these limitations, geometrically inspired extensions such as Geometric Transfer Learning (GTL) have been developed, incorporating structural constraints between domains [
14]. Nonetheless, they depend on linear subspace representations, which fail to adequately preserve the support of the target domain in the presence of data heterogeneity or limited representational capacity [
15]. In addition, techniques such as Transfer Joint Matching (TJM), Transfer Component Analysis (TCA), and Maximum Independence Domain Adaptation (MIDA) seek to align both marginal and conditional distributions via linear projections [
16,
17,
18]. Yet, they do not guarantee class separability in the latent space, particularly under concept shift or class imbalance, resulting in ambiguous decision boundaries and diminished discriminative performance [
19]. A comparable deficiency is noted in Joint Distribution Adaptation (JDA), which, despite modeling joint alignment, assumes uniform relevance across classes and lacks adaptive mechanisms to address intra-class heterogeneity or instance-level significance [
20].
Due to the structural constraints of traditional domain adaptation techniques, particularly the decoupling of feature transformation and prediction phases, deep learning methods have emerged as a more cohesive solution for preserving domain-invariant features across the representation space [
21]. These approaches leverage the expressive capabilities of deep neural networks to jointly optimize feature extraction and domain alignment, enhancing adaptability under covariate shift [
22]. Adversarial training-based models, including Domain-Adversarial Neural Networks (DANNs) and their extensions, have demonstrated considerable effectiveness in aligning marginal distributions within a shared latent space [
23,
24]. Still, while these methods reduce global disparities, they often struggle to maintain class separability, as they do not explicitly model conditional structures or discriminative boundaries [
25]. To overcome these limitations, hybrid models have emerged that integrate deep learning architectures with statistical alignment objectives, enabling end-to-end optimization for improved domain adaptation performance [
26]. These approaches aim to preserve both predictive accuracy and domain invariance by combining supervised losses with the minimization of statistical discrepancies across multiple network layers [
27,
28]. However, hybrid methods also face challenges, such as gradient conflicts between classification and alignment objectives and semantic misalignment caused by noisy pseudo-labels [
29]. In parallel, self-supervised learning (SSL) has been introduced into domain adaptation pipelines to alleviate the dependence on labeled target data, typically by leveraging contrastive objectives to learn transferable features without explicit supervision [
30,
31,
32]. More recently, foundation models—large-scale pretrained architectures with broad generalization capacity—have opened new avenues for adaptation by employing mechanisms such as prompt tuning, adapter modules, or domain-specific fine-tuning [
33,
34]. While these strategies show promise, their deployment in the presence of domain shift remains constrained by semantic misalignment and high computational cost [
35]. Although deep learning has significantly advanced the extraction of domain-invariant features, ensuring discriminative consistency and semantic alignment in the target domain remains a critical challenge [
36].
Despite notable advances in deep learning techniques designed to extract domain-invariant features, many of these methods struggle to maintain a discriminative class structure within the target domain [
21,
22]. To address this, transfer-based strategies—such as fine-tuning, teacher–student models, meta-learning frameworks, and asymmetric architectures like Adversarial Discriminative Domain Adaptation (ADDA)—have been introduced to enhance inter-class separation through adaptive training or auxiliary supervision [
25,
37,
38,
39]. However, these methods often suffer from limitations including degradation of pretrained representations and sensitivity to noise [
40,
41] and the absence of explicit modeling of class boundaries, particularly in ADDA variants [
42]. Conditional alignment techniques, such as Conditional Adversarial Domain Adaptation (CDAN), address part of this shortcoming by incorporating classifier outputs into the discriminator, thereby capturing class-conditional dependencies [
43]. Nonetheless, they remain vulnerable to class imbalance and low-confidence predictions, which can lead to distorted decision boundaries [
36]. In response to these challenges, information-theoretic approaches have emerged as a complementary paradigm, optimizing transfer through objectives based on mutual information or entropy [
44,
45]. By leveraging strategies such as entropy minimization and the information bottleneck principle, these methods regularize latent representations, thereby mitigating overfitting on the source domain and improving generalization under target shift [
46,
47,
48].
In addition to generalization and discriminability, interpretability has become a pivotal aspect of domain adaptation, especially in high-stakes applications where understanding model behavior is essential for fostering trust, transparency, and accountability [
49]. In this context, latent space analysis has proven valuable for examining the structure of learned representations. Linear techniques such as Principal Component Analysis (PCA) offer computational efficiency but fall short in capturing the nonlinear relationships relevant across multiple domains [
50]. In contrast, nonlinear methods like t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are more effective in representing complex inter-domain structures [
51]. UMAP, in particular, stands out for its ability to preserve both local and global structures, maintain stability under parameter variation, and scale efficiently—making it especially useful for visualizing semantic alignment across domains [
52,
53]. Moreover, interpretability is especially crucial in sensitive applications. Among post hoc methods, Gradient-weighted Class Activation Mapping (Grad-CAM) generates attention maps that highlight regions influencing model predictions, while its extension, Grad-CAM++, improves spatial resolution through higher-order derivatives, though it remains limited by nonlinear activation functions [
54,
55,
56]. In domain adaptation, Grad-CAM++ has proven effective not only as an explainability tool but also for visually assessing semantic consistency across domains [
57]. Other approaches, such as Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP), provide quantitative insights by assigning relevance scores to input features, aiding the identification of spurious patterns or conflicting decision rules [
58]. The lack of interpretability methods specifically designed for transfer learning and domain adaptation remains a significant limitation, highlighting the need for more robust explanatory tools tailored to cross-domain scenarios [
59].
Here, we propose Conditional Rényi -Entropy Domain Adaptation (CREDA), a novel domain adaptation framework designed to simultaneously preserve domain-invariant representations, enforce class-conditional alignment, and mitigate the effect of noisy pseudo-labels. The core idea of CREDA is to regularize deep feature alignment using a differentiable, matrix-based formulation of Rényi’s quadratic entropy, which provides a non-parametric and robust estimate of class-wise distributional similarity. CREDA is implemented as an end-to-end trainable architecture comprising three key stages:
- –
Deep Feature Extraction: A shared ResNet-18 backbone encodes samples from both source and target domains into a latent representation space.
- –
Noise-Aware Label Weighting: An entropy-derived confidence score is used to down-weight low-confidence pseudo-labels in the target domain, improving robustness against noisy or ambiguous predictions.
- –
Class-Conditional Alignment via Rényi-based entropy: A novel entropy-based regularization term is applied over kernel Gram matrices to minimize divergence between class-wise source and target feature distributions.
We evaluate CREDA on three widely used visual domain adaptation benchmarks for image classification: Digits, ImageCLEF-DA, and Office-31. Additionally, we compare its performance against state-of-the-art methods—including DANN, ADDA, and CDAN+E— across various backbone architectures such as ResNet-18, ResNet-50, and Vision Transformers (ViT). The results consistently demonstrate that CREDA achieves superior performance in terms of classification accuracy, semantic alignment, and interpretability, with improvements of average accuracy across benchmarks. Qualitative analyses using UMAP and Grad-CAM++ further confirm that CREDA maintains both inter-class separability and cross-domain semantic coherence, highlighting its potential for deployment in real-world, label-scarce environments.
The remainder of this paper is organized as follows:
Section 2 introduces the materials and methods.
Section 3 and
Section 4 discuss the experiments and results. Finally,
Section 5 outlines the concluding remarks.
3. Experimental Set-Up
To rigorously evaluate the effectiveness of the proposed CREDA framework for domain adaptation in image classification tasks, we present a comprehensive analysis that includes descriptions of the benchmark datasets, training protocols, comparative baselines, and quantitative and qualitative performance assessments.
3.1. Tested Datasets
To assess the effectiveness and robustness of the proposed domain adaptation method, we conducted extensive experiments on three widely recognized benchmark datasets commonly used in domain adaptation research. Each dataset encompasses visual domains exhibiting substantial distribution shifts, thereby providing a challenging setting for learning domain-invariant representations, as detailed below:
- –
Digits: This benchmark suite is designed for evaluating domain adaptation on digit recognition tasks, spanning both handwritten and natural-scene digits. It comprises three standard datasets: MNIST (M), a large database of handwritten digits; USPS (U), another handwritten digit set characterized by its lower resolution; and SVHN (S), which contains house numbers cropped from real-world street-level images [
71]. Notably, the S domain is particularly challenging due to its significant variability in lighting, background clutter, and visual styles compared to M and U (see
Figure 2).
- –
ImageCLEF-DA: This is a standard benchmark for unsupervised domain adaptation, organized as part of the ImageCLEF evaluation campaign. It comprises 12 common object classes shared across three distinct visual domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P), see
Figure 3. Each domain contains 600 images, with a balanced distribution of 50 images per class [
72]. All images are resized to
pixels.
- –
Office-31: It consists of 4110 images across 31 object classes, sourced from three domains with distinct visual characteristics: Amazon (A), which features centered objects on a clean, white background under controlled lighting; Webcam (W), containing low-resolution images with typical noise and color artifacts; and DSLR (D), which includes high-resolution images with varying focus and lighting conditions [
73]. Here, we selected a subset of ten shared classes (see
Figure 4).
Together, these benchmarks allows evaluating the capacity of domain adaptation methods to generalize across diverse and challenging visual domains.
3.2. Assessment and Method Comparison
To comprehensively evaluate the impact of the feature extractor’s architecture on model performance, we experimented with three distinct backbones: a standard ResNet-18, its deeper counterpart ResNet-50, and a ViT. Each backbone is adapted for feature extraction in domain transfer tasks by removing its final classification layer. The primary baseline is a ResNet-18 convolutional backbone pretrained on ImageNet [
74]. To tailor the architecture for our tasks, the final fully connected layer is removed, while all preceding convolutional and residual blocks are retained. This modification enables the extraction of high-level spatial representations that are robust and transferable across domains [
75]. A comprehensive description of the ResNet-18 feature extractor’s architecture is provided in
Table 1.
Afterward, to investigate the effect of network depth, we also employed a ResNet-50 backbone, a deeper and more powerful variant within the ResNet family [
74]. ResNet-50 utilizes bottleneck residual blocks, which are more computationally efficient for deeper networks [
76]. Similar to the ResNet-18 configuration, the model is pretrained on ImageNet, and its final fully connected layer is removed to serve as a feature extractor. This results in a 2048-dimensional feature vector. The detailed architecture is presented in
Table 2.
Also, to explore an alternative architectural paradigm beyond convolutional networks, we incorporated a ViT-based model, specifically the
vit_tiny_patch16_224 variant (termed ViT-Tiny) [
77]. Unlike CNNs, ViT-Tiny processes images by splitting them into a sequence of fixed-size patches, which are then linearly embedded and fed into a standard Transformer encoder. For this study, we use a ViT-Tiny pretrained on ImageNet with an input resolution of
. The classification head is discarded, and the output embedding of the special
[CLS] token from the final Transformer block is used as the feature representation, yielding a 192-dimensional vector. The architecture is detailed in
Table 3.
Moreover, the following domain adaptation strategies are considered for comparison:
- –
Baseline: A thenceforward approach is trained exclusively on the source domain without any adaptation mechanism, see
Figure 5. The optimization objective is to minimize the conventional supervised cross-entropy loss, which serves as a lower bound for performance evaluation under domain shift:
- –
DANN: The Domain-Adversarial Neural Network (DANN) [
78] introduces a domain discriminator,
, which is trained to distinguish source features from target ones, see
Figure 6. The discriminator is implemented as a multi-layer neural network, where a predicted label of 1 indicates source domain membership, and 0 indicates target domain membership. Moreover, the feature extractor is simultaneously trained to produce features that fool the discriminator, thereby learning domain-invariant representations via a Gradient Reversal Layer (GRL). The overall objective is a minimax game:
where
represents a trade-off hyperparameter. The domain adversarial loss
is the binary cross-entropy for domain classification, where source samples are assigned domain label 0, and target samples label 1.
- –
ADDA: The Adversarial Discriminative Domain Adaptation (ADDA) framework [
79] separates the training into two distinct stages, see
Figure 7. First, a source feature extractor
and the classifier
are trained using the supervised loss
(see Equation (
24)). In the second stage, the parameters of
and
are frozen. Then, a new target feature extractor,
(initialized with the weights in
), is then trained to fool the domain discriminator in a minimax game (see Equation (
25)). The objective is to align the target feature distribution with the fixed source feature distribution.
- –
CDAN+E: The Conditional Domain Adversarial Network (CDAN) [
80] enhances adversarial alignment by using a multilinear feature representation,
, as input to the domain discriminator
. The CDAN+E variant, as implemented in standard benchmarks, employs a sophisticated entropy-based mechanism that serves a dual purpose: it implements entropy minimization for the target domain while simultaneously weighting the adversarial loss to focus on more reliable samples, as seen in
Figure 8.
Specifically, the Shannon entropy
is computed for the predictions
of all samples in a batch. This entropy value is then used in two ways. First, it is passed through a GRL, which implicitly creates an entropy minimization objective for the feature extractor, encouraging it to produce more confident (low-entropy) predictions. Second, the entropy is transformed into a sample-wise weight, as follows:
This weighting scheme gives greater importance to samples with confident predictions (low entropy), thereby focusing the adversarial alignment on well-structured regions of the feature space. The resulting weighted conditional adversarial loss,
, is then defined as follows:
where both
and
are calculated according to Equation (
26). The total loss for the CDAN+E framework can thus be expressed as the combination of the supervised loss and this integrated adversarial and entropy-regularized objective (see Equation (
25)).
Overall, two main components are employed depending on the training objective: a label classifier for supervised task learning and a domain discriminator for adversarial domain adaptation. Namely, the label classifier transforms the feature vector of dimension
d, produced by the backbone, into a vector of
C class logits. The value of
d depends on the specific feature extractor employed (e.g., 512 for ResNet-18, 2048 for ResNet-50, and 192 for ViT-Tiny). The corresponding architecture is presented in
Table 4.
In adversarial training, a domain discriminator is employed to differentiate between source and target samples, thereby promoting domain-invariant feature extraction. Its input dimension
d is determined by the underlying method. For instance, DANN and ADDA use the feature vector directly, while CDAN+E utilizes the outer product between features and class predictions, yielding an input dimension
. The architecture, which mirrors the general structure of the label classifier, is detailed in
Table 5.
In all experimental scenarios, we report the classification accuracy and its associated standard deviation in the test set of the target domain. Moreover, during training, model performance is periodically evaluated on validation subsets drawn from both source and target domains to monitor intermediate generalization behavior. In this sense, the Accuracy (ACC) measure is defined as follows:
where
and
denote the predicted and ground truth labels, respectively.
is the indicator function that returns 1 if the condition is true and 0 otherwise. The standard deviation is estimated from the batch-wise accuracies, serving as a proxy for model stability during inference. The Baseline model is trained solely on labeled samples from the source domain and is directly evaluated in the target domain without any adaptation mechanisms. This setting establishes a lower bound for performance under domain shift conditions.
In addition to quantitative measures, we assess the discriminative quality of the learned feature representations using qualitative techniques. Specifically, we employ the well-known Uniform Manifold Approximation and Projection (UMAP) [
52], a nonlinear dimensionality reduction technique to project high-dimensional features into a two-dimensional latent space, enabling visual inspection of inter-domain and inter-class separability [
81]. This technique facilitates an empirical evaluation of how well the feature extractor captures semantically consistent structures across domains. To further complement this analysis, we apply the GradCAM++ method to the classifier module in order to visualize spatial attention regions associated with individual predictions [
82]. These attention maps provide insight into the decision-making process of the model and support a comparative interpretation of class activation patterns across source and target domains.
3.3. Training Details
The training procedure follows the standard protocol for unsupervised domain adaptation: all labeled data from the source domain are used along with the entire set of unlabeled data from the target domain. The latter approach aims to learn domain-invariant representations without requiring explicit supervision in the target domain.
All models are trained using the Adam optimizer. For the non-adaptive baseline, models are trained with a fixed learning rate (
for ResNet architectures and
for ViT-Tiny) and no weight decay. For all domain adaptation methods, a dynamic scheduling scheme is employed for both the learning rate and the adversarial weighting parameter to promote stable convergence and mitigate early overfitting of the discriminator. Both hyperparameters are updated according to the relative training progress
, according to the following expressions:
where the schedule hyper-hyperparameters are updated to
,
, and
(see
Figure 9).
In addition to stratified sampling, the batch size is dynamically adjusted based on the size of the training set (
N) in each domain, according to the following empirical rule:
The initial learning rate was empirically tuned for each model, method, and dataset, typically ranging from to . Notably, the first stage of ADDA was trained with a fixed learning rate of . Furthermore, to adapt the pretrained ViT-Tiny architecture for the lower-resolution Digits dataset (), we applied bicubic interpolation to its positional embeddings. This step was necessary to align the spatial dimensions of the pretrained weights (originally for inputs) with the target image size, enabling effective knowledge transfer.
Next, to maintain class balance during model training and evaluation, an initial partition is performed into training (70%), validation (15%), and test (15%) subsets. This process is conducted independently for both the source and target domains. To ensure representative subsets, stratified sampling is applied within each partition, preserving the internal class distributions of each domain. In particular, the independent construction of the validation sets enables consistent and comparable evaluation conditions across domains, which is essential in domain adaptation scenarios where distributional shifts may introduce evaluation bias.
The lower and upper bounds were established empirically. The lower bound ensures the existence of at least 10 mini-batches per epoch, contributing to optimization stability and preventing prohibitively long training times on small datasets. Conversely, the upper bound avoids excessively large batches that could destabilize learning or exceed GPU memory capacity. This configuration strikes an effective trade-off between gradient stability and computational efficiency, especially when handling domains of different sizes.
It is important to note that, since both dataset partitioning and batch size are determined by the number of available samples in each domain, the number of training instances per epoch is not the same across domains. This asymmetry reflects the inherent scale differences between datasets and allows each domain to contribute proportionally to the learning process without enforcing artificial uniformity.
For all experiments, the kernel bandwidth parameter
used in the estimation of Rényi’s quadratic entropy was adaptively determined for each training batch using the median heuristic. This common practice involves setting
as the square root of the median of all pairwise squared Euclidean distances within the combined source and target feature batch, as follows [
83]:
This data-driven approach automates a critical hyperparameter, ensuring that the kernel’s scale is appropriately tailored to the feature distribution, which enhances the stability and effectiveness of the alignment process across domains.
Moreover, to qualitatively assess the discriminative capacity of the learned features, we apply dimensionality reduction using UMAP, leveraging the GPU-accelerated cuML implementation. Unless otherwise stated, the default parameters are set as follows: n_components = 2, n_neighbors = 80, and random_state = 42. Prior to projection, features are normalized with MinMaxScaler, which facilitates visual inspection of inter-class and inter-domain separability in the latent space. Also, we employ the GradCAM++ technique via the torchcam library to visualize class-specific attention regions within the input images. Representative samples for each class are selected from both source and target domains, and the last convolutional layer of the feature extractor is designated as the target layer. The resulting attention masks are normalized and overlaid on the corresponding images, offering a qualitative perspective on the spatial focus of the model during classification.
Our experiments were conducted on the Google Colab platform, leveraging a high-performance instance equipped with a NVIDIA (Santa Clara, CA, USA) A100 GPU (40.0 GB of VRAM), 83.5 GB of system RAM, and 235.7 GB of disk storage. For full reproducibility, we set a global random seed of 42 across Python, NumPy 2.0.2, and PyTorch (for both CPU and CUDA) and configured the cuDNN backend to use deterministic algorithms, ensuring consistent results from GPU computations. The development environment was based on
Python 3.11.11, using
PyTorch 2.1.2 for model training,
cuML 25.02.01 for GPU-accelerated UMAP visualization, and
torchcam 0.4.0 for
GradCAM++. All source code and datasets are publicly available at:
https://github.com/Daprosero/Domain_Adaptation (accessed on 4 July 2025).
5. Conclusions
This work introduced a novel domain adaptation framework, termed Conditional Rényi -Entropy Domain Adaptation (CREDA), a deep learning-based strategy integrating kernel-based conditional alignment from a matrix-based formulation of Rényi’s quadratic entropy. CREDA is structured around three key components. First, a deep feature extractor is used to learn domain-invariant representations by leveraging labeled source data and unlabeled target data. Second, an entropy-weighted strategy attenuates the influence of low-confidence pseudo-labels, thereby enhancing robustness in ambiguous regions. Third, a class-conditional alignment loss, expressed as a Rényi divergence, is introduced to promote semantic consistency across domains within the latent representation space. In contrast to supervised or semi-supervised approaches, the proposed method does not require labels in the target domain, making it particularly suitable for scenarios where annotation is costly or unavailable. Moreover, our class-wise alignment is formulated in a non-parametric and differentiable manner by leveraging kernel-based information potentials, enabling the preservation of semantic structure across domains.
Experimental results across diverse visual adaptation scenarios demonstrate that CREDA consistently outperforms conventional methods such as DANN, ADDA, and CDAN+E in terms of predictive accuracy, representational quality, and interpretability. In particular, CREDA achieves the highest average accuracy across all datasets and architectures, with noticeable improvements when using deeper CNNs (ResNet-50) and attention-based models (ViT-Tiny). While most adversarial approaches experience performance degradation in these settings, CREDA remains robust and effective, as evidenced by the results presented in this study. Notably, CREDA maintains class separability even under complex distribution shifts and when the predicted labels in the target domain exhibit low confidence. The integration of UMAP- and GradCAM++-based visualizations offers valuable insights into the learned representations, reinforcing its applicability in real-world settings where traceability and semantic coherence are critical. From an implementation standpoint, CREDA does not require modifications to the classification loss function. Its confidence-aware weighting scheme and class-conditional regularization enhance robustness to pseudo-label noise and class imbalance. Moreover, its modular architecture facilitates seamless integration into existing deep learning pipelines.
As future work, we aim to test CREDA on larger-scale datasets. Also, we plan to extend CREDA to multi-source and continual domain adaptation settings, where domain shifts occur either simultaneously or sequentially. Attention-based class-conditioned alignment across multiple source domains has been shown to mitigate negative transfer and effectively address class imbalance [
85]. Second, we plan to incorporate class-conditional kernel alignment and attention-guided feature disentanglement to improve both interpretability and discriminative alignment, particularly in contexts characterized by subtle inter-class distinctions or limited labeled data. Additionally, exploring temporal or streaming variants of CREDA could prove beneficial in online adaptation scenarios, where data arrives sequentially and models must adapt incrementally. Recent advances in attention-aware class-conditioned alignment suggest that these mechanisms yield robust feature representations and highlight relevant discriminative regions in multi-source adaptation [
86]. Finally, while CREDA was conceived for the standard unsupervised adaptation setting, its extension to more challenging scenarios, such as few-shot or source-free adaptation, remains uninvestigated [
87]. Addressing these limitations would not only enhance the robustness of the proposed framework but also broaden its applicability to more complex transfer learning problems.