You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

10 January 2025

MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks

,
,
,
and
1
Department of Information Management, Chihlee University of Technology, Taipei 220305, Taiwan
2
Department of Computer Science and Information Engineering, Tamkang University, Taipei 251301, Taiwan
3
Department of Applied Mathematics, National Chung Hsing University, Taichung 402202, Taiwan
4
Multidisciplinary Graduate Engineering, College of Engineering, Northeastern University, Boston, MA 02115, USA
This article belongs to the Special Issue Advances in Intelligent Computing, Machine Learning and Pattern Recognition

Abstract

This paper introduces a novel unsupervised domain adaptation (UDA) method, MeTa Discriminative Class-Wise MMD (MCWMMD), which combines meta-learning with a Class-Wise Maximum Mean Discrepancy (MMD) approach to enhance domain adaptation. Traditional MMD methods align overall distributions but struggle with class-wise alignment, reducing feature distinguishability. MCWMMD incorporates a meta-module to dynamically learn a deep kernel for MMD, improving alignment accuracy and model adaptability. This meta-learning technique enhances the model’s ability to generalize across tasks by ensuring domain-invariant and class-discriminative feature representations. Despite the complexity of the method, including the need for meta-module training, it presents a significant advancement in UDA. Future work will explore scalability in diverse real-world scenarios and further optimize the meta-learning framework. MCWMMD offers a promising solution to the persistent challenge of domain adaptation, paving the way for more adaptable and generalizable deep learning models.

1. Introduction

The success of deep learning relies heavily on large annotated datasets. However, annotating a substantial number of images with object content is a time-consuming and labor-intensive task. The advent of Generative Adversarial Networks (GANs) [] has partially alleviated this issue, facilitating advancements in deep learning by enabling the creation of synthetic data. Despite this progress, existing learning algorithms often struggle with limited generalization across different datasets—a challenge known as domain adaptation (DA). Traditional recognition tasks typically assume that training data (source domain) and testing data (target domain) share a common distribution. In practice, this assumption rarely holds, as test data can come from diverse sources and modalities, leading to poor generalization and the phenomenon known as domain shift.
Various methods have been proposed to tackle domain adaptation [,,,,], focusing mainly on aligning feature distributions between domains by measuring and minimizing differences. Another approach in UDA leverages meta-learning to generalize across new, unlabeled domains by learning adaptable representations. For instance, Vettoruzzo et al. [] proposed a meta-learning framework that optimizes model parameters to achieve effective adaptation across domains with minimal labeled data, showing strong adaptability even with limited unlabeled test samples. This method emphasizes efficient domain adaptation, leveraging knowledge from prior domains to improve generalization under distribution shifts. Recent advancements in deep unsupervised domain adaptation (UDA) have introduced more sophisticated strategies. For instance, a comprehensive 2022 review [] examined developments such as feature alignment, self-supervision, and representation learning, highlighting current trends and future directions. A 2023 approach employing domain-guided conditional diffusion models [] demonstrated enhanced transfer performance by generating synthetic samples for the target domain, thus bridging domain gaps more effectively. Additionally, cross-domain contrastive learning [] has shown promise in promoting domain-invariant features by minimizing feature distances across domains, and manifold-based techniques like Discriminative Manifold Propagation [] have leveraged probabilistic criteria and metric alignment to achieve both transferability and discriminability.
Domain-Adversarial Neural Networks (DANNs) [] introduced adversarial training with a gradient reversal layer, laying the groundwork for adversarial domain adaptation approaches. ADDA (Adversarial Discriminative Domain Adaptation) [] further improved this framework by incorporating untied weight sharing for flexible feature alignment. Deep Adaptation Networks (DANs) [] employed Maximum Mean Discrepancy (MMD) for kernel-based feature alignment, establishing an influential precedent in UDA. Techniques such as CyCADA [] combined pixel-level and feature-level adaptations to comprehensively mitigate domain shifts, while MCD (Maximum Classifier Discrepancy) [] used classifier-based discrepancy maximization to enhance target domain adaptation.
A significant challenge in domain adaptation lies in effectively measuring these distances [,]. Classical metrics such as Quadratic [], Kullback–Leibler [], and Mahalanobis [] distances often lack flexibility and fail to generalize across models. Maximum Mean Discrepancy (MMD) [], which embeds distribution metrics within a Reproducing Kernel Hilbert Space, has gained traction due to its robust theoretical foundation and application in various settings, such as transfer learning [], kernel Bayesian inference [], approximate Bayesian computation [], and MMD GANs []. Despite its simplicity, selecting the optimal bandwidth for Gaussian kernels in MMD remains challenging. Liu et al. [] addressed this by introducing a parameterized deep kernel, known as Maximum Mean Discrepancy with a Deep Kernel (MMDDK), which adapts kernel parameters for more precise domain alignment.
MMD effectively aligns overall domain distributions but struggles with precise class-wise feature alignment. Long et al. [] addressed this by proposing Class-Wise Maximum Mean Discrepancy (CWMMD), which maps samples from both domains into a shared space and calculates the MMD for each category, summing them to derive the CWMMD. However, these approaches often involve linear transformations, which may not capture complex relationships needed for deeper alignment. Wang et al. [] provided insights into the MMD’s theoretical foundations, highlighting its role in extracting shared semantic features across diverse categories while maximizing intra-class distances between source and target domains. This approach, however, reduced feature discriminativeness and relied on linear transformations with L2 norm estimations, which may not suffice for general, nonlinear relationships [,]. In contrast, deep neural networks, particularly convolutional neural networks (CNNs), excel at learning expressive, nonlinear transformations. Our previous work [] proposed training a CNN architecture to automatically learn task-specific feature representations.
Meta-learning, or “learning to learn”, has gained attention for its ability to rapidly adapt to new tasks [,]. This proposal introduces a novel UDA method that leverages a class-wise, deep kernel-based MMD, optimized through meta-learning. This approach aims to enhance the adaptability and performance of UDA models by incorporating flexible, data-driven kernel learning mechanisms.
The contributions of this paper are summarized as follows: (1) It presents the development of the novel MCWMMD framework, which combines meta-learning with a Class-Wise MMD approach, specifically enhancing class-wise distribution alignment for unsupervised domain adaptation (UDA). (2) It introduces a meta-module that dynamically learns a deep kernel, optimizing domain alignment by adapting to the unique characteristics of each class distribution. (3) It provides a demonstration of improved cross-domain recognition performance, validated through extensive experiments on diverse benchmark datasets, showcasing the framework’s adaptability and effectiveness.

3. The Proposed Method

The proposed unsupervised domain adaptation (UDA) approach primarily utilizes Discriminative Class-Wise Maximum Mean Discrepancy (MMD) to align the class-level data distributions of the source and target domains, which addresses the issue of reduced feature distinguishability when MMD minimizes the mean deviation between two different domains, thereby effectively achieving the goal of UDA. However, the MMD used here is learned with a meta-module MTMMD to obtain MMD with deep kernels (MMDDK). The framework of the proposed method is directly called MeTa Discriminative Class-Wise MMD (MCWMMD), as shown in Figure 1. The orange block represents the feature extractor F, which is responsible for extracting domain-invariant features. The green block represents the classifier C, which predicts class labels based on the extracted features. The light red block represents the meta-module MTMMD, referred to as “MMDDK”, which is designed to measure the distance between feature distributions of samples from the two domains.
Figure 1. Training framework for MCWMMD.
The training objective of the meta-module MTMMD is to enhance its ability to discriminate between the two domains. This is achieved by updating MMDDK to maximize the feature distance measurement values of samples from the two domains. Conversely, the training objective of the MCWMMD module is to update the feature extractor F so that the feature distance measurement values of samples from the two domains, computed using the current MTMMD, are minimized.
These opposing training objectives result in a process resembling adversarial training, where the two modules iteratively adjust to counteract each other. This alternating training process allows each module to improve its performance while balancing the influence of the other. The remainder of this section will introduce the detailed training processes of these two modules.

3.1. Deep Kernel Training Network

According to the Maximum Mean Discrepancy with Deep Kernels (MMDDK) defined by Liu et al. [], as explained in Equation (7), we construct a training network for MMDDK, as depicted in Figure 2. The input to the training network for MMDDK is the feature vector z = F x extracted from the MCWMMD network, where two vectors, z s = F x s and z t = F x t (referred to as first-order features), are used to compute the Gaussian function value q z s ,   z t in Equation (9). Additionally, they are separately input into another feature extractor F d to obtain z ^ s = F d z s and z ^ t = F d z t (referred to as second-order features), which are used to compute the Gaussian function value κ z ^ s , z ^ t in Equation (8). Subsequently, the two Gaussian function values q z s ,   z t and κ z ^ s , z ^ t are combined using the operator k ω defined in Equation (7) to calculate the deep kernel distance k ω z s ,   z t . The parameters σ ϕ , σ q , weight ϵ , and network parameters θ d of the network F d are jointly trained using this deep neural network, denoted as F ω , where ω = θ d ,   σ ϕ ,   σ q ,   ϵ . Its training is based on maximizing the objective function J λ in Equation (10). The network training algorithm is presented in Algorithm 1.
Algorithm 1 Training the MMDDK
Input: η 1 ;
Initialize  ω ( θ d ,   σ ϕ ,   σ q ,   ϵ ) ;
repeat until convergence
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
   Z s F X s ; Z t F X t ;
   Z ^ s F d Z s ; Z ^ t F d Z t ;
   M ω M M D D K u 2 Z s , Z t ; k ω ;#using (11)
   V ω σ 2 Z s , Z t ; k ω ;#using (12) with λ = 0
   J ω M ω / V ( ω )  ;#using (10)
  #update parameters:
   ω ω + η 1 ω J ω ;#maximizing J
end repeat
Figure 2. Training framework for MMDDK.

3.2. Meta-Learning of Maximum Mean Discrepancy

In this section, we redefine and compute the Maximum Mean Discrepancy (MMD) originally defined and calculated by Wang et al. [] in the context of linear transformation spaces, but now in the Reproducing Kernel Hilbert Space (RKHS) using the kernel trick for straightforward MMD computation. Consequently, we also redefine their definitions of between-class distance squared ( S i n t e r ), within-class distance squared ( S i n t r a ), and variance ( S v a r ), and demonstrate that under the Gaussian kernel-based MMD, the between-class distance-squared MMD between the source and target domains can be decomposed into the sum of within-class distance-squared MMDs from both domains and their variance difference.
As described in Section 2, minimizing the between-class distance squared S i n t e r is equivalent to maximizing the within-class distance squared S i n t r a for both the source and target domains while simultaneously minimizing their total variance S v a r , which leads to decreased feature discriminability. To address this issue, a balancing parameter β ( 1 β 1 ) is applied to the hidden within-class distance squared S i n t r a within S i n t e r , proposing the discriminability class-level loss function L d c w m m d defined in Equation (34), which can be rewritten as Equation (35).
For convenience, let us define M M D 0 = S s t i n t e r 0 = M M D . Hence, Equation (35) can also be rewritten as Equation (36), where the second term represents the sum of marginal MMD and conditional MMD, and the coefficient β = β + 1 adjusts between 0 and 2, i.e., 0 β 2 .
L d c w m m d = β + 1 · S i n t r a + S v a r S i n t r a + M M D 2 Z s , Z t = β + 1 · S i n t r a + S i n t e r + M M D 2 Z s , Z t
L d c w m m d = β · c = 1 C S s t i n t r a c + c = 0 C M M D c 2
Although we adopt the Discriminative Class-Wise Maximum Mean Discrepancy (DCWMMD), where MMD is computed based on a Gaussian function in a Reproducing Kernel Hilbert Space (RKHS), there is no reliable method to select the appropriate bandwidth value for the Gaussian function. Therefore, in this study, we choose the Maximum Mean Discrepancy with Deep Kernels (MMDDK) proposed by Liu et al. [], where the bandwidth is learned by the network, endowing the MMDDK with stronger discriminative power. To further adapt the MMDDK to the mean discrepancy calculations for different domain pairs, we employ meta-learning to learn this MMDDK, resulting in a method called MeTa Maximum Mean Discrepancy (MTMMD), which is more suitable for efficient optimization using gradient descent [,].
Our proposed MTMMD network architecture, as shown in Figure 3, is based on concepts similar to previous meta-learning loss functions []. It parameterizes the Maximum Mean Discrepancy through a neural network F ψ , which receives the second-order features Z ^ s and Z ^ t predicted by the MMDDK model F ψ , along with bandwidths σ ϕ and σ q and the weight ϵ . We aim to learn the parameters ψ such that when ω = θ d ,   σ ϕ ,   σ q ,   ϵ is updated through F ψ , the final performance is optimal. The learning of parameters ω = θ d ,   σ ϕ ,   σ q ,   ϵ involves maximizing not only the original objective function J but also the meta-learning objective function J m t output by F ψ = M ψ ,   V ψ . The parameters ω and ψ are alternately updated, as shown in Equations (37) and (38).
Figure 3. MTMMD network training process.
The primary goal of both updates is to maximize the value of the mean discrepancy function; hence, we aim for both J m t and J to be maximized, with the parameter adjustments being positive multiples of the partial derivatives. The MTMMD network training architecture is illustrated in Figure 3. Since MMDDK and MTMMD are trained together, the gradient values of the MMDDK objective function J are also used to update its parameters ω , modifying Equations (37)–(39). The MTMMD network training and inference algorithms are presented in Algorithms 2 and 3, respectively. Subsequently, the domain adaptation training uses the two-domain mean discrepancy loss function, where the M M D c 2 in L d c w m m d is replaced by the meta-learning M m t c in Equation (40), resulting in the loss function L m t d c w m m d in Equation (41).
ω t + 1 ω t + α 1 · J m t F ψ F ω t Z s , Z t ; k ω t ω t
ψ t + 1 ψ t + α 2 · J F ω t + 1 Z s , Z t ; k ω t + 1 ψ t
ω t + 1 ω t + α 0 · J F ω t Z s , Z t ; k ω t ω t + α 1 · J m t F ψ F ω t Z s , Z t ; k ω t ω t
M m t c = M ψ ( F ω Z s c , Z t c ; k ω
L m t d c w m m d = β · c = 1 C S s t i n t r a c + c = 0 C M m t c
The MMDDK model F ω passes its predictions F ω t to the meta-module MTMMD F ψ , which outputs F ψ t = M m t t , V m t t , where M m t t is the mean discrepancy value and V m t t is the variance. We optimize ψ to ensure that when optimizing the MMDDK model for J m t with F ω t + 1 Z s , Z t ; k ω t + 1 , the updated ω t + 1 performs better (i.e., there is a higher value of the MMDDK objective function J). To achieve this, we take a gradient step on the meta-module’s objective function J m t to update the MMDDK model parameters ω t + 1 , and then we update ψ by evaluating ω t + 1 using the MMDDK objective function J .
Algorithm 2 Training MTMMD
Input: α 0 , α 1 , α 2 ;
Initialize  ω and ψ : sets of parameters for MMDDK Model F ω and MTMDD Model F ψ , T 10 , 000 ; # ω = ( θ d ,   σ ϕ ,   σ q ,   ϵ ) ;
for t  0 to T do
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
    X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
    Z s F X s ; Z t F X t ;
    Z ^ s , Z ^ t , σ ϕ , σ q , ϵ F ω Z s , Z t ; k ω ; # Z ^ s = F d Z s ; Z ^ t = F d Z t ;
   #alternatively update parameters ω and ψ:
    M M M D D K u 2 Z s , Z t ; k ω ;#using (11)
    V σ 2 Z s , Z t ; k ω ;#using (12) with λ = 0
    J M / V ; #using (10)
   if t is even then
      M m t , V m t F ψ Z ^ s , Z ^ t , σ ϕ , σ q , ϵ ;
      J m t M m t / V m t ;
      ω ω + α 0 ω J + α 1 ω J m t ;#maximizing  J m t
   else
      ω ω + α 2 ψ J ;#maximizing J
end for
Algorithm 3 MTMMD Inferencing
Input:  ω and ψ ;
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
   Z s F X s ; Z t F X t ;
   Z ^ s , Z ^ t , σ ϕ , σ q , ϵ F ω Z s , Z t ; k ω ; # Z ^ s = F d Z s ; Z ^ t = F d Z t ;
   M m t , V m t F ψ Z ^ s , Z ^ t , σ ϕ , σ q , ϵ ;
return  M m t

3.3. MeTa Discriminative Class-Wise Maximum Mean Discrepancy

The proposed UDA approach, based on MeTa Discriminative Class-Wise Maximum Mean Discrepancy (MCWMMD), includes a feature extractor F for extracting domain-invariant features for the classifier C , as shown in Figure 1. Inputs x s and x t are fed into the feature extractor F , resulting in outputs z s = F x s and z s = F x t . These outputs are then input into the classifier C for classification predictions, producing l ^ s = C z s and l ^ t = C z t . In practice, the batch size for both the source domain and the target domain is set to N , with a total of C category labels. The feature extractor F extracts features from input samples X s = x s i i = 1 N and X t = x t j j = 1 N , and outputs Z s = z s i i = 1 N and Z t = z t j j = 1 N , respectively. These features, Z s and Z t , are then input into the classifier C for classification. In the diagram, F and C are depicted twice to correspond to the data paths of the source and target domains, with a dashed line in between to indicate shared parameters. The MTMMD network will be trained by minimizing the total loss function L t o t a l , as defined in Equation (42), where L d c w m t m m d is defined in Equation (41) and L c l s l s is defined in Equation (44). It is a label-smoothed version of the classification cross-entropy in Equation (43), designed to encourage samples to fall into compact, uniform, and well-separated clusters. The original prediction y s i is replaced by 1 α y s i c + α / C , where 1 is a vector of ones with C dimensions, and α is the smoothing parameter. Additionally, L e n t represents the predicted label entropy of the target sample, as shown in Equation (45). The network training algorithm for this MCWMMD module is presented in Algorithm 4.
L t o t a l = L m t d c w m m d + ω 2 · L c l s l s + ω 3 · L e n t
L c l s Z s , Y s = 1 n s i = 1 n s c = 1 C y s i c log l ^ s i c
L c l s l s Z s , Y s = 1 n s i = 1 n s c = 1 C 1 α y s i c + α / C log l ^ s j c
L e n t Z t = 1 N j = 1 N c = 1 C l ^ t j c log l ^ t j c
Algorithm 4 Training MCWMMD model
Input: Δ s , Δ t , β 1 , β 2 , η 2 ;
Initialize parameters θ F  and  θ C ;
# train the model parameters θ F  and  θ C  on  Δ s  and Δ t ;  
repeat until convergence
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N  mini-batch from Δ t ;
   Z s  F X s ; Z t  F X t ;
  #generate pseudo labels:
   L ^ t = l ^ t 1 , l ^ t 2 , , l ^ t N     C F X t ;#classify target samples
   Y t = y t 1 , y t 2 , , y t N psd l ^ t 1 , psd l ^ t 2 , , psd l ^ t N ; #obtain pseudo labels
  # psd (( v 1 , v 2 , , v C )) = argmax 1 c C v c ;
  # evaluate losses:
   L m c w m m d X s , X t = β · c = 1 C S s t i n t r a c + c = 0 C M m t c ;#using (41)
   L c l s l s 1 N i = 1 N c = 1 C 1 α y s i c + α / C log l ^ s j c ; #using (44)
   L e n t   1 N j = 1 N c = 1 C l ^ t j c log l ^ t j c ;#using (45)
   L t o t a l L m c w m m d + β 1 L c l s l s + β 2 L e n t ; #using (42)
  # update  θ F   and   θ C   to minimize L t o t a l ;
   θ F θ F η 2   θ F L t o t a l ;
   θ C θ C η 2 θ C L t o t a l ;
end repeat

4. Experimental Results

This section presents a comprehensive evaluation of the JDA approach on standard UDA datasets for image classification tasks. The details of the data preparation process are outlined in Section 4.1, while the experimental setup, including model configurations and parameters, is discussed in Section 4.2. Finally, Section 4.3 provides the experimental results and comparisons with baseline methods to demonstrate the effectiveness of the proposed approach.

4.1. Data Preparation

The proposed approach was evaluated on both digit and office object datasets. The digit datasets used in this study included the MNIST (Modified National Institute of Standards and Technology) database [], USPS (U.S. Postal Service) [], and SVHN (Street View House Numbers) []. The MNIST and USPS consist of grayscale images of handwritten digits, with the MNIST offering 60,000 training samples and 10,000 testing samples and USPS comprising 9298 images, divided into 7291 training and 2007 testing samples. In contrast, SVHN provides 73,257 color training images and 26,032 testing images, depicting digits captured in a street-view context. Figure 4 shows sample images from the MNIST, USPS, and SVHN, with training samples highlighted in blue.
Figure 4. Digit data: (a) MNIST, (b) USPS, and (c) SVHN.
For the office object datasets, we used Office-31 [] and Office-Home []. The Office-31 dataset consists of 4652 images within 31 categories collected from three distinct domains: Amazon (A), which contains images from online merchants; DSLR (D), with high-resolution images taken using a digital SLR camera; and Webcam (W), featuring low-resolution images captured using a web camera. This dataset covers 31 common office object categories, totaling 4110 images. The Office-Home dataset introduces a more complex domain shift, with four distinct domains—Art (Ar), Clipart (Cl), Product (Pr), and Real World (Rw)—spanning 65 object categories and approximately 15,500 images, each offering varied visual styles. Figure 5 and Figure 6 provide sample images from the Office-31 and Office-Home datasets, respectively.
Figure 5. Office-31 data: (a) Webcam, (b) DSLR, and (c) Amazon.
Figure 6. Office-Home data.

4.2. Experimental Setting

An initial learning rate of 0.001 was used for all experiments, decayed by a factor of 0.1 every 10 epochs. The batch size was set to 128 for the digit datasets and 64 for the office object datasets. The Adam optimizer was used with parameters β1= 0.99 and β2 = 0.999, and it was chosen for its ability to handle sparse gradients. Training lasted for 50 epochs on the digit datasets and 100 epochs on the office object datasets to ensure convergence. A regularization term of 0.0005 was applied to prevent overfitting. The Gaussian kernel used in the MMD calculations had an initial bandwidth of 1.0, dynamically optimized through the meta-learning framework. At the beginning of each epoch, pseudo-labels for all target domain training data were generated based on the current classifier parameters. This iterative process helped refine domain alignment while maintaining computational efficiency.
Experiments were conducted on a server equipped with NVIDIA RTX 2080 GPUs (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) and 256 GB of system RAM (provided by ADATA, Taiwan). The implementation was carried out using Python with the PyTorch deep learning library (version 1.8), along with NumPy and SciPy for data preprocessing and statistical computations.

4.3. Results

ResNet-18 and ResNet-50 [] were employed as the network architectures for feature extraction from the digit and office object datasets, respectively. Both models were fine-tuned using pre-trained ImageNet parameters. The performance of the proposed method was evaluated on the above-mentioned datasets: digit datasets, Office-31, and Office-Home. For the digit datasets, we tested domain adaptation between pairs such as MNIST to USPS (M → U), USPS to MNIST (U → M), and SVHN to MNIST (S → M). In the Office-31 dataset, we examined six domain adaptation pairs (e.g., Amazon to DSLR (A → D) and Webcam to DSLR (W → D)). For the Office-Home dataset, we created 12 domain adaptation pairs across four domains (Art, Clipart, Product, and Real-World), including examples like Art to Clipart (Ar → Cl), Product to Real-World (Pr → Rw), and so on.
Table 2 compares our method with several domain adaptation techniques on the digit datasets, including ADDA [], ADR [], CDAN [], CyCADA [], SWD [], SHOT [], and our previous work, DCWMMD []. Table 3 provides a comparison of the Office-31 dataset, including methods such as that by Wang et al. [], DAN [], DANN [], ADDA, MADA [], SHOT, CAN [], MDGE [], DACDM [], CDCL [], DMP [], and DCWMMD []. Table 4 compares the results of the Office-Home dataset with methods like that used by Wang et al., DAN, DACDM [], DMP [], and DCWMMD. Please note that the results are directly referenced from published papers. The best-performing method for each source-to-target combination is highlighted in bold. The bold numbers in the tables indicate the best-performing accuracy for each source-to-target combination. The “Source-only” category represents a classifier trained solely on source data, while “Target-supervised” denotes a classifier trained and tested on target domain data, typically representing lower and upper bounds for domain adaptation performance.
Table 2. Accuracies (%) of several approaches on some digit datasets.
Table 3. Accuracies (%) for domain adaptation experiments on the Office-31 dataset.
Table 4. Accuracies (%) for domain adaptation experiments on the Office-Home dataset.
In Table 2, our method achieves an average accuracy of 98.60% across digit datasets, outperforming other methods and closely approaching the target-supervised scenario. This highlights the robustness of our approach in aligning domain distributions and achieving class-wise alignment. Table 3 presents the results of the Office-31 dataset, where our method achieved an average accuracy of 91.62%, consistently outperforming other unsupervised adaptation methods and closely matching the target-supervised benchmark. This result underscores the effectiveness of our Class-Wise MMD optimization method in adapting complex, real-world data. In Table 4, our method achieves an average accuracy of 75.43% on the Office-Home dataset, a challenging multi-domain setting with diverse visual characteristics. These results highlight the adaptability and robustness of our approach as it generalizes effectively across multiple domains and significantly closes the gap with the target-supervised benchmark. This performance demonstrates our method’s capability to handle complex domain shifts while maintaining high accuracy across diverse visual domains.
t-SNE (t-distributed Stochastic Neighbor Embedding) [] is a nonlinear dimensionality reduction technique commonly used to visualize high-dimensional data in a lower-dimensional space (typically 2D or 3D). By preserving local structures within the data, t-SNE excels in representing clusters and relationships, making it particularly useful for visualizing stochastic settings and complex data distributions. In this study, we employ t-SNE to visualize the feature representations learned by our model for both the source and target domains, highlighting the effectiveness of the proposed domain adaptation approach.
Columns (a) and (b) of Figure 7 depict the distributions of source features and target features, respectively, with different digits represented by distinct colors. Specifically, the 10 colors correspond to the digits 0 through 9, where each color uniquely represents a digit for clear differentiation in the visualization. Column (c) of Figure 7 provides an integrated view of both distributions to highlight their alignment. As observed, the source and target features are well aligned, demonstrating the effectiveness of our approach. The t-SNE visualization effectively highlights the alignment between source and target feature distributions, reflecting the improved feature alignment achieved by our method compared to the baseline.
Figure 7. t-SNE visualization of three tasks on digit datasets: (a) source, (b) target, and (c) source (red color) + target (blue color) (best viewed in color).

5. Discussion and Conclusions

The proposed method, MeTa Discriminative Class-Wise MMD (MCWMMD), represents a significant advancement in unsupervised domain adaptation by integrating meta-learning with a Class-Wise Maximum Mean Discrepancy (MMD) approach. While traditional MMD methods align overall distributions between source and target domains, they often fail to achieve precise class-wise alignment, reducing feature distinguishability and generalization performance. MCWMMD addresses these limitations by introducing dynamic kernel adaptability and a focus on class-wise alignment, resulting in robust and domain-invariant representations.
A key innovation of MCWMMD is its dynamic kernel adaptability, achieved through a meta-module that adjusts kernel parameters based on class-specific features. This enables more precise domain alignment compared to traditional static kernels, significantly enhancing alignment and generalization. The alternating training process between the feature extractor and the meta-module, inspired by adversarial training, further refines the model’s ability to handle complex domain shifts. However, this adaptability introduces computational complexity, which could be a limitation for time-sensitive applications. Future research could explore simplifying the meta-module to reduce overhead while preserving adaptability.
The method’s class-wise alignment approach applies MMD in a class-specific manner, ensuring that each class is individually aligned between source and target domains. This produces compact, domain-invariant, and class-discriminative feature clusters, ultimately improving cross-domain classification performance. However, its reliance on accurate pseudo-labels for class-wise alignment may lead to errors when the pseudo-label quality is low. Developing robust pseudo-labeling strategies is a crucial direction for future research.
MCWMMD is also optimized for scalability through efficient batch processing and streamlined meta-module training, enabling practical application to large datasets without compromising alignment accuracy. However, scaling it extremely large or dynamically evolving datasets remains a challenge. Future work could investigate distributed or online learning paradigms to extend the method’s applicability to these scenarios.
Despite its strengths, MCWMMD involves complex meta-module training and adversarial-like processes, which may pose implementation challenges, particularly for practitioners with limited computational resources. Further validation in real-world scenarios with highly diverse and complex domain shifts is also needed. Promising future directions include simplifying the meta-module for enhanced accessibility, improving pseudo-labeling mechanisms, and extending the method to handle online and incremental domain adaptation for dynamic datasets. Additionally, exploring cross-domain generalization to unseen categories or settings could further enhance the method’s adaptability.
In summary, MCWMMD advances unsupervised domain adaptation by combining meta-learning with a Class-Wise MMD approach, addressing the limitations of traditional techniques. Its dynamic kernel adaptability and focus on class-wise alignment enable robust feature alignment and generalization. While challenges such as computational complexity and reliance on pseudo-labels remain, MCWMMD provides a strong foundation for future innovations, paving the way for more adaptable and generalizable deep learning models.

Author Contributions

Conceptualization, H.-W.L., C.-T.T. and H.-J.L.; Methodology, H.-W.L.; Software, T.-T.H. and C.-T.T.; Validation, H.-W.L. and C.-T.T.; Formal Analysis, H.-J.L.; Investigation, T.-T.H. and H.-W.L.; Resources, T.-T.H.; Data Curation, C.-H.Y.; Writing—Original Draft Preparation, H.-J.L.; Writing—Review and Editing, H.-W.L.; Visualization, C.-H.Y.; Supervision, H.-W.L.; Project Administration, H.-J.L.; Funding Acquisition, H.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, R.O.C., under grant NSTC 113-2221-E-032-020.

Data Availability Statement

(1) SVHN dataset []: available online: https://www.openml.org/search?type=data&sort=runs&id=41081&status=active (accessed on 1 March 2024). (2) Office-31 dataset []: Introduced by Kate Saenko et al. in Adapting Visual Category Models to New Domain. (3) Office-Home dataset []. Available online: https://www.hemanthdv.org/officeHomeDataset.html (accessed on 1 March 2024). (4) t-SNE []: available online: http://www.jmlr.org/papers/v9/vandermaaten08a.html (accessed on 1 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lin, Y.; Chen, J.; Cao, Y.; Zhou, Y.; Zhang, L.; Tang, Y.Y.; Wang, S. Cross-domain recognition by identifying joint subspaces of source domain and target Domain. IEEE Trans. Cybern. 2017, 47, 1090–1101. [Google Scholar] [CrossRef]
  2. Khan, S.; Guo, Y.; Ye, Y.; Li, C.; Wu, Q. Mini-batch dynamic geometric embedding for unsupervised domain adaptation. Neural Process. Lett. 2023, 55, 2063–2080. [Google Scholar] [CrossRef]
  3. Zhang, W.; Ouyang, W.; Li, W.; Xu, D. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
  4. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar] [CrossRef]
  5. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2962–2971. [Google Scholar] [CrossRef]
  6. Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 97–105. [Google Scholar]
  7. Vettoruzzo, A.; Bouguelia, M.-R.; Rögnvaldsson, T.S. Meta-learning for efficient unsupervised domain adaptation. Neurocomputing 2024, 574, 127264. [Google Scholar] [CrossRef]
  8. Liu, X.; Yoo, C.; Xing, F.; Oh, H.; El Fakhri, G.; Kang, J.-W.; Woo, J. Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives. arXiv 2022, arXiv:2208.07422v1. [Google Scholar] [CrossRef]
  9. Zhang, Y.; Chen, S.; Jiang, W.; Zhang, Y.; Lu, J.; Kwok, J.T. Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation. arXiv 2023, arXiv:2309.14360v1. [Google Scholar] [CrossRef] [PubMed]
  10. Wang, R.; Wu, Z.; Weng, Z.; Chen, J.; Qi, G.-J.; Jiang, Y.-G. Cross-Domain Contrastive Learning for Unsupervised Domain Adaptation. arXiv 2022, arXiv:2106.05528v2. [Google Scholar] [CrossRef]
  11. Luo, Y.-W.; Ren, C.-X.; Dai, D.-Q.; Yan, H. Unsupervised Domain Adaptation via Discriminative Manifold Propagation. IEEE Trans. Pattern Anal. Machine Intell. 2022, 44, 1653–1669. [Google Scholar] [CrossRef] [PubMed]
  12. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
  13. Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. arXiv 2018, arXiv:1712.02560v4. [Google Scholar] [CrossRef]
  14. Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster R-CNN for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
  15. Si, S.; Tao, D.; Geng, B. Bregman divergence based regularization for transfer subspace learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 929–942. [Google Scholar] [CrossRef]
  16. Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Wortman, J. Learning bounds for domain adaptation. In Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 129–136. [Google Scholar]
  17. Ding, Z.; Fu, Y. Robust transfer metric learning for image classification. IEEE Trans. Image Process. 2017, 26, 660–670. [Google Scholar] [CrossRef]
  18. Gretton, A.; Borgwardt, K.; Rasch, M.; Sch, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  19. Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 2011, 22, 199–210. [Google Scholar] [CrossRef] [PubMed]
  20. Song, L.; Gretton, A.; Bickson, D.; Low, Y.; Guestrin, C. Kernel belief propagation. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 707–715. [Google Scholar]
  21. Park, M.; Jitkrittum, W.; Sejdinovic, D. K2-ABC: Approximate bayesian computation with kernel embeddings. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 398–407. [Google Scholar]
  22. Li, Y.; Swersky, K.; Zemel, R.S. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1718–1727. [Google Scholar]
  23. Liu, F.; Xu, W.; Lu, J.; Zhang, G.; Gretton, A.; Sutherland, D. Learning deep kernels for non-parametric two-sample tests. arXiv 2020, arXiv:2002.09116. [Google Scholar]
  24. Long, M.; Wang, J.; Ding, G.; Sun, J.; Yu, P.S. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; IEEE Computer Society: Sydney, Australia, 2013; pp. 2200–2207. [Google Scholar]
  25. Wang, W.; Li, H.; Ding, Z.; Wang, Z. Rethink maximum mean discrepancy for domain adaptation. arXiv 2020, arXiv:2007.00689. [Google Scholar] [CrossRef]
  26. Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
  27. Baraud, Y.; Birgé, L. Rho-estimators revisited: General theory and applications. Ann. Statist. 2018, 46, 3767–3804. [Google Scholar] [CrossRef]
  28. Lin, H.-W.; Tsai, Y.; Lin, H.J.; Yu, C.-H.; Liu, M.-H. Unsupervised domain adaptation deep network based on discriminative class-wise MMD. AIMS Math. 2024, 9, 6628–6647. [Google Scholar] [CrossRef]
  29. Andrychowicz, M.; Denil, M.; Colmenarejo, S.G.; Hoffman, M.W.; Pfau, D.; Schaul, T.; de Freitas, N. Learning to learn by gradient descent by gradient descent. arXiv 2016, arXiv:1606.04474v2. [Google Scholar] [CrossRef]
  30. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  31. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  32. Zheng, S.; Ding, C.; Nie, F.; Huang, H. Harmonic mean linear discriminant analysis. IEEE Trans. Knowl. Data Eng. 2019, 31, 1520–1531. [Google Scholar] [CrossRef]
  33. Borgwardt, K.M.; Gretton, A.; Rasch, M.J.; Kriegel, H.-P.; Sch, B.; Smola, A.J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006, 22, e49–e57. [Google Scholar] [CrossRef]
  34. Gao, W.; Shao, M.; Shu, J.; Zhuang, X. Meta-BN Net for few-shot learning. Front. Comput. Sci. 2023, 17, 171302. [Google Scholar] [CrossRef]
  35. Bechtle, S.; Molchanov, A.; Chebotar, Y.; Grefenstette, E.; Righetti, L.; Sukhatme, G.; Meier, F. Meta-learning via learned loss. arXiv 2019, arXiv:1906.05374. [Google Scholar]
  36. Müller, R.; Kornblith, S.; Hinton, G. When does label smoothing help? In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  37. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  38. Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Machine Intell. 1994, 16, 550–555. [Google Scholar] [CrossRef]
  39. SVHN Dataset. Available online: https://www.openml.org/search?type=data&sort=runs&id=41081&status=active (accessed on 1 March 2024).
  40. Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting visual category models to new domains. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314, pp. 213–226. [Google Scholar] [CrossRef]
  41. Office-Home Dataset. Available online: https://www.hemanthdv.org/officeHomeDataset.html (accessed on 1 March 2024).
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  43. Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Adversarial dropout regularization. arXiv 2018, arXiv:1711.01575. [Google Scholar] [CrossRef]
  44. Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1647–1657. [Google Scholar]
  45. Lee, C.Y.; Batra, T.; Baig, M.H.; Ulbricht, D. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10285–10295. [Google Scholar]
  46. Liang, J.; Hu, D.; Feng, J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 6028–6039. [Google Scholar]
  47. Pei, Z.; Cao, Z.; Long, M.; Wang, J. Multi-adversarial domain adaptation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
  48. van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. Available online: http://www.jmlr.org/papers/v9/vandermaaten08a.html (accessed on 1 March 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.