Next Article in Journal
Localization and Flatness in Quantale Theory
Next Article in Special Issue
Linear Jointly Disjointness-Preserving Maps Between Rectangular Matrix Spaces
Previous Article in Journal
A Dual-Path Neural Network for High-Impedance Fault Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks

by
Hsiau-Wen Lin
1,*,
Trang-Thi Ho
2,
Ching-Ting Tu
3,*,
Hwei-Jen Lin
2,* and
Chen-Hsiang Yu
4
1
Department of Information Management, Chihlee University of Technology, Taipei 220305, Taiwan
2
Department of Computer Science and Information Engineering, Tamkang University, Taipei 251301, Taiwan
3
Department of Applied Mathematics, National Chung Hsing University, Taichung 402202, Taiwan
4
Multidisciplinary Graduate Engineering, College of Engineering, Northeastern University, Boston, MA 02115, USA
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(2), 226; https://doi.org/10.3390/math13020226
Submission received: 12 December 2024 / Revised: 5 January 2025 / Accepted: 6 January 2025 / Published: 10 January 2025

Abstract

:
This paper introduces a novel unsupervised domain adaptation (UDA) method, MeTa Discriminative Class-Wise MMD (MCWMMD), which combines meta-learning with a Class-Wise Maximum Mean Discrepancy (MMD) approach to enhance domain adaptation. Traditional MMD methods align overall distributions but struggle with class-wise alignment, reducing feature distinguishability. MCWMMD incorporates a meta-module to dynamically learn a deep kernel for MMD, improving alignment accuracy and model adaptability. This meta-learning technique enhances the model’s ability to generalize across tasks by ensuring domain-invariant and class-discriminative feature representations. Despite the complexity of the method, including the need for meta-module training, it presents a significant advancement in UDA. Future work will explore scalability in diverse real-world scenarios and further optimize the meta-learning framework. MCWMMD offers a promising solution to the persistent challenge of domain adaptation, paving the way for more adaptable and generalizable deep learning models.

1. Introduction

The success of deep learning relies heavily on large annotated datasets. However, annotating a substantial number of images with object content is a time-consuming and labor-intensive task. The advent of Generative Adversarial Networks (GANs) [1] has partially alleviated this issue, facilitating advancements in deep learning by enabling the creation of synthetic data. Despite this progress, existing learning algorithms often struggle with limited generalization across different datasets—a challenge known as domain adaptation (DA). Traditional recognition tasks typically assume that training data (source domain) and testing data (target domain) share a common distribution. In practice, this assumption rarely holds, as test data can come from diverse sources and modalities, leading to poor generalization and the phenomenon known as domain shift.
Various methods have been proposed to tackle domain adaptation [2,3,4,5,6], focusing mainly on aligning feature distributions between domains by measuring and minimizing differences. Another approach in UDA leverages meta-learning to generalize across new, unlabeled domains by learning adaptable representations. For instance, Vettoruzzo et al. [7] proposed a meta-learning framework that optimizes model parameters to achieve effective adaptation across domains with minimal labeled data, showing strong adaptability even with limited unlabeled test samples. This method emphasizes efficient domain adaptation, leveraging knowledge from prior domains to improve generalization under distribution shifts. Recent advancements in deep unsupervised domain adaptation (UDA) have introduced more sophisticated strategies. For instance, a comprehensive 2022 review [8] examined developments such as feature alignment, self-supervision, and representation learning, highlighting current trends and future directions. A 2023 approach employing domain-guided conditional diffusion models [9] demonstrated enhanced transfer performance by generating synthetic samples for the target domain, thus bridging domain gaps more effectively. Additionally, cross-domain contrastive learning [10] has shown promise in promoting domain-invariant features by minimizing feature distances across domains, and manifold-based techniques like Discriminative Manifold Propagation [11] have leveraged probabilistic criteria and metric alignment to achieve both transferability and discriminability.
Domain-Adversarial Neural Networks (DANNs) [4] introduced adversarial training with a gradient reversal layer, laying the groundwork for adversarial domain adaptation approaches. ADDA (Adversarial Discriminative Domain Adaptation) [5] further improved this framework by incorporating untied weight sharing for flexible feature alignment. Deep Adaptation Networks (DANs) [6] employed Maximum Mean Discrepancy (MMD) for kernel-based feature alignment, establishing an influential precedent in UDA. Techniques such as CyCADA [12] combined pixel-level and feature-level adaptations to comprehensively mitigate domain shifts, while MCD (Maximum Classifier Discrepancy) [13] used classifier-based discrepancy maximization to enhance target domain adaptation.
A significant challenge in domain adaptation lies in effectively measuring these distances [2,14]. Classical metrics such as Quadratic [15], Kullback–Leibler [16], and Mahalanobis [17] distances often lack flexibility and fail to generalize across models. Maximum Mean Discrepancy (MMD) [18], which embeds distribution metrics within a Reproducing Kernel Hilbert Space, has gained traction due to its robust theoretical foundation and application in various settings, such as transfer learning [19], kernel Bayesian inference [20], approximate Bayesian computation [21], and MMD GANs [22]. Despite its simplicity, selecting the optimal bandwidth for Gaussian kernels in MMD remains challenging. Liu et al. [23] addressed this by introducing a parameterized deep kernel, known as Maximum Mean Discrepancy with a Deep Kernel (MMDDK), which adapts kernel parameters for more precise domain alignment.
MMD effectively aligns overall domain distributions but struggles with precise class-wise feature alignment. Long et al. [24] addressed this by proposing Class-Wise Maximum Mean Discrepancy (CWMMD), which maps samples from both domains into a shared space and calculates the MMD for each category, summing them to derive the CWMMD. However, these approaches often involve linear transformations, which may not capture complex relationships needed for deeper alignment. Wang et al. [25] provided insights into the MMD’s theoretical foundations, highlighting its role in extracting shared semantic features across diverse categories while maximizing intra-class distances between source and target domains. This approach, however, reduced feature discriminativeness and relied on linear transformations with L2 norm estimations, which may not suffice for general, nonlinear relationships [26,27]. In contrast, deep neural networks, particularly convolutional neural networks (CNNs), excel at learning expressive, nonlinear transformations. Our previous work [28] proposed training a CNN architecture to automatically learn task-specific feature representations.
Meta-learning, or “learning to learn”, has gained attention for its ability to rapidly adapt to new tasks [29,30]. This proposal introduces a novel UDA method that leverages a class-wise, deep kernel-based MMD, optimized through meta-learning. This approach aims to enhance the adaptability and performance of UDA models by incorporating flexible, data-driven kernel learning mechanisms.
The contributions of this paper are summarized as follows: (1) It presents the development of the novel MCWMMD framework, which combines meta-learning with a Class-Wise MMD approach, specifically enhancing class-wise distribution alignment for unsupervised domain adaptation (UDA). (2) It introduces a meta-module that dynamically learns a deep kernel, optimizing domain alignment by adapting to the unique characteristics of each class distribution. (3) It provides a demonstration of improved cross-domain recognition performance, validated through extensive experiments on diverse benchmark datasets, showcasing the framework’s adaptability and effectiveness.

2. Related Work and Key Concepts

This section delves into the foundations and advancements of the Maximum Mean Discrepancy (MMD) metric, a widely used method for measuring the difference between distributions in domain adaptation tasks. We review the evolution of MMD, discussing its theoretical underpinnings, variations, and applications across different models. Additionally, we explore how recent research has extended the MMD to address more complex distributional challenges, including conditional and joint distributions, and we highlight the limitations that these methods seek to overcome. This study considers only two domains for domain adaptation, one source domain and one target domain. X s and X t represent the sample sets from the source domain and the target domain, respectively, and X (or X s t ) represents the union of all sample sets in both domains, i.e., X = X s t = X s X t . More symbols and notations are presented in a nomenclature table provided in Table 1.

2.1. Domain Adaptation

In machine learning, domain adaptation (DA) is a subfield of transfer learning that focuses on the scenario where there is a significant difference between the data distribution of the training set (source domain) and the test set (target domain). The goal of domain adaptation is to adapt a model trained on the source domain so that it performs well on the target domain despite the differences in data distributions.
The source domain Δ s is the domain from which we have access to labeled data. Let X s = x s i ,   y s i i = 1 m denote the set of m labeled data points from the source domain Δ s , where x s i represents the i -th data point, and y s i is the corresponding label indicating the class to which x s i belongs. The label y s i belongs to a set of predefined class labels C = 1 , , C . The target domain Δ t is the domain to which we want to apply the learned model, but where we only have access to unlabeled data. Let X t = x t j j = 1 n denote the set of n unlabeled data points from the target domain Δ t . Each data point x t j belongs to one of the classes in C , but its corresponding label y t i is not observed during training. The source and target domains share a common set of class labels C = 1 , , C . This implies that, theoretically, the same classes exist in both domains, but the way these classes are represented (i.e., the data distribution) may differ. For instance, the source domain might consist of high-resolution images, while the target domain could consist of lower-resolution images or images taken under different lighting conditions. This distributional difference between the domains poses significant challenges for traditional machine learning models, which typically assume that the training and test data are drawn from the same distribution. To address this challenge, domain adaptation techniques often involve aligning the data distributions between the source and target domains by transforming the feature space or modifying the learning algorithm. One effective method for this is Maximum Mean Discrepancy (MMD), which minimizes the distance between the distributions of the source and target domains in a common latent space. By reducing this distribution shift, MMD helps improve the model’s generalization ability on the target domain, making it a crucial technique for successful domain adaptation.

2.2. RKHS, Kernels, and the Kernel Trick

A Reproducing Kernel Hilbert Space (RKHS) [31] is a powerful mathematical framework widely used in kernel-based learning algorithms. In an RKHS, every function f can be represented as an inner product involving a kernel function k , which serves as a measure of similarity between data points. Specifically, for any function f in the RKHS and any point x , the value of f at x can be represented as shown in Equation (1), where · , · H denotes the inner product in the RKHS, and k x , · is the kernel function centered at x .
f x = f ,   k x , · H
The kernel function k x , y implicitly maps data into a high-dimensional feature space, enabling the capture of complex relationships that may not be apparent in the original lower-dimensional space. A widely used kernel is the Gaussian (or RBF) kernel, defined as shown in Equation (2), where σ is a parameter that controls the width of the kernel. The Gaussian kernel measures the similarity between two points, x and y , based on their distance.
k x , y = exp x y 2 2 2 σ 2
The kernel trick is a crucial technique that enables efficient computation in high-dimensional spaces without explicitly performing mapping. This trick leverages the kernel function to compute the inner product between two points in the feature space H directly in the input space X without needing to know the explicit form of the mapping φ · . For example, let φ x and φ y be the mappings of data points x and y into the feature space. The inner product in x and y can be directly evaluated as shown in Equation (3), where k x , y is the kernel function. This means that the product of two elements in the high-dimensional feature space can be evaluated directly in the original input space using the kernel function, such as the Gaussian kernel given in Equation (2).
φ x ,   φ y H = k x , y
By utilizing the kernel trick, algorithms can efficiently handle nonlinear patterns in the data, making RKHS, kernels, and the kernel trick fundamental components of modern machine learning. This approach simplifies the learning process and reduces computational complexity, enabling operations that would typically require high-dimensional computations to be performed directly in the original input space.

2.3. Maximum Mean Discrepancy (MMD)

Assume that the random samples X = x 1 , , x m and Y = y 1 , , y n come from two probability distributions P and Q , respectively. The kernel mean embeddings for these distributions are given by μ P = E x P ϕ x and μ Q = E y Q ϕ y , where the function ϕ maps the samples into a Reproducing Kernel Hilbert Space (RKHS) H . The Maximum Mean Discrepancy (MMD) [18] between X and Y is defined as the difference between these means in the RKHS, as shown in Equation (4), where F is the set of functions in the unit ball of the universal RKHS. By squaring the MMD, we can use the kernel trick to compute it directly on the samples with a kernel function k without needing the explicit form of ϕ , as illustrated in Equation (5). The Gaussian kernel shown in Equation (2) is usually used as the kernel function. In practice, for samples X and Y , the MMD formula can be adjusted to yield an unbiased estimate, as described in Equation (6).
MMD P , Q = μ P μ Q H
M M D 2 P , Q = μ P μ Q ,   μ P μ Q H = μ P , μ P H + μ Q , μ Q H 2 μ P , μ Q H = E x , x P k x ,   x + E y , y Q k y , y 2 E x P , y Q k x ,   y
M M D u 2 X , Y = 1 m m 1 i j m k x i , x j + 1 n n 1 i j n k y i , y j 2 m n i , j m , n k x i , y j

2.4. The Mean Discrepancy with a Deep Kernel

While the Maximum Mean Discrepancy (MMD) defined in a Reproducing Kernel Hilbert Space (RKHS) is a powerful tool for measuring the mean difference between two samples, one of the significant challenges lies in the selection of the bandwidth σ for the Gaussian kernel used in the computation. The choice of σ is crucial as it directly impacts the sensitivity of the MMD to differences in distributions. However, there is no definitive method for optimally selecting this bandwidth, which can limit the effectiveness of the MMD in practice. To address the issue of bandwidth selection, Liu et al. [23] introduced the Maximum Mean Discrepancy with a Deep Kernel (MMDDK), as described in Equation (7). In this approach, F d represents a deep neural network that is employed to extract features from the input data. Within this learned feature space, an inner kernel κ is applied, typically a Gaussian function with bandwidth σ ϕ , as shown in Equation (8). Additionally, an inner kernel q is applied directly in the input space, also using a Gaussian function but with bandwidth σ q , as depicted in Equation (9).
The MMDDK framework innovatively combines these kernels by defining a composite kernel function k ω x , y that integrates both the feature space kernel and the input space kernel. The bandwidth parameters σ ϕ and σ q , the weight ϵ , and the deep network parameters θ d are all jointly optimized through a deep learning approach. This joint optimization allows for adaptive bandwidth selection and improved alignment between the source and target distributions.
The entire MMDDK framework is denoted by F ω , where ω = θ d ,   σ ϕ ,   σ q ,   ϵ , encapsulating all the parameters involved in the model. The training process aims to maximize an objective function J λ , as shown in Equation (10), which balances the MMD-based discrepancy measure M M D D K u 2 and the variance σ ^ 1 , λ 2 , defined in Equation (11) and Equation (12), respectively. Here, 1 refers to the alternative hypothesis in a two-sample test, P Q , where λ is a regularization constant that ensures stability in the optimization process. The function H i , j , as defined in Equation (13), calculates the contribution of pairs of samples from both domains, integrating the kernel evaluations across different sample pairs to compute the overall discrepancy. This MMDDK approach addresses the limitations of traditional MMD by allowing for more flexible and adaptive kernel learning, improving the effectiveness of domain adaptation in scenarios where the optimal bandwidth is difficult to determine. The assumption of equal sample sizes in both domains (i.e., m = n ) simplifies the computations and ensures that the statistical properties of the test remain robust.
k ω x ,   y = 1 ϵ κ F d x , F d y + ϵ q x ,   y
κ a , b = exp a b 2 2 2 σ ϕ 2
q a ,   b = exp a b 2 2 2 σ q 2
J λ X , Y ; k ω = M M D D K u 2 X , Y ; k ω σ ^ 1 , λ X , Y ; k ω
M M D D K u 2 X , Y ; k ω = 1 n n 1 i j H i , j
σ ^ 1 , λ 2 = 4 n 3 i = 1 n j = 1 n H i , j 2 4 n 4 i = 1 n j = 1 n H i , j 2 + λ
H i , j = k ω x i , x j + k ω y i , y j k ω x i , y j k ω y i , x j

2.5. Class-Wise Maximum Mean Discrepancy

In domain adaptation, the key challenge arises from the differences between the source and target domains in both marginal and conditional distributions. The marginal distribution captures the overall sample distribution within a domain, while the conditional distribution refers to the distribution of samples within specific classes. Although the Maximum Mean Discrepancy (MMD) is a powerful tool for measuring distributional differences, its common application focuses solely on aligning marginal distributions, often neglecting the alignment of samples with the same labels across domains. This can result in suboptimal performance, particularly when the conditional distributions between the source and target domains differ significantly. To address this issue, Long et al. [24] proposed Joint Distribution Adaptation (JDA), which extends the use of MMD to align both marginal and conditional distributions within a shared linear transformation space. JDA aims to generate feature representations that not only bridge the domain gap but are also robust to significant distributional differences.
In JDA, the source domain samples, X s   R d × n s , and the target domain samples, X t R d × n t , are mapped onto a common feature space through a linear orthogonal transformation. Here,   n s and n t denote the number of samples in the source and target domains, respectively, and d is the dimension of the samples. The transformation matrix A , which is of size   d × K , maps the original data points into a K -dimensional feature space. The transformed data points for the source domain are given by A T x i , and similarly, for the target domain by A T x j . The primary objective of this transformation is to minimize the discrepancy between the means of the transformed samples from the source and target domains in this new feature space. The discrepancy is formalized in Equation (14), which represents the MMD they define, where the terms 1 n s x i X s A T x i and 1 n t x j X t A T x j represent the mean vectors of the transformed data points from the source and target domains, respectively. Their goal is to minimize the Euclidean distance between these two mean vectors, which effectively aligns the marginal distributions of the two domains in the new feature space. On the right side of Equation (14), the trace operation t r A T X s t M 0 X s t T A is used to express the squared Euclidean distance between the means in a matrix form; the matrix X s t = [ X s | X t ] is the concatenated data matrix containing both the source and target samples, resulting in a matrix of size d × ( n s + n t ) ; and the matrix M 0 R n s t × n s t , defined in Equation (15), is constructed to measure the pairwise relationships between samples in the source and target domains. The elements M 0 i j define how the relationship between pairs of samples is weighted during the optimization process. When both samples x i and x j belong to the source domain ( x i , x j X s ), the element M 0 i j is assigned a positive weight 1 n s n s . Similarly, when both samples belong to the target domain ( x i , x j X t ), the weight is 1 n t n t . These positive weights contribute to aligning the means of the samples within each domain. For pairs where one sample is from the source domain and the other from the target domain, M 0 i j is assigned a negative weight 1 n s n t . These negative weights are crucial for minimizing the discrepancy between the source and target domain means by penalizing large differences between them. The matrix M 0 plays a pivotal role in the optimization objective by guiding the linear transformation A to map the source and target samples into a common feature space where their distributions are aligned. The trace operation in Equation (14) sums the weighted differences across all pairs of samples, driving the minimization process towards the optimal alignment of both marginal and conditional distributions. The JDA method, through the use of the linear transformation matrix A and the carefully constructed matrix M 0 , effectively addresses the limitations of traditional MMD by jointly aligning both marginal and conditional distributions. This joint alignment is crucial for improving the performance of domain adaptation tasks, particularly in scenarios where the source and target domains exhibit significant distributional differences. The mathematical framework provided by Equations (14) and (15) ensures that the adaptation process considers the complex relationships between the source and target domains, leading to more robust and generalizable models.
M M D 2 = | | 1 n s x i X s A T x i 1 n t x j X t A T x j | | 2 2 = t r A T X s t M 0 X s t T A
M 0 i j = 1 n s n s , x i , x j X s 1 n t n t , x i , x j X t 1 n s n t , o t h e r w i s e
The challenge of matching conditional distributions (i.e., distributions conditioned on class labels) arises from the difficulty of doing so without labeled data in the target domain. To address this, Long et al. [24] proposed using pseudo labels for the target samples. These pseudo labels can be inferred by applying classifiers trained on the labeled source data to the unlabeled target data. This allows for the approximation of class-conditional distributions in the target domain, enabling the calculation of the discrepancy between class-conditional distributions in the source and target domains. To quantify this discrepancy, Long et al. introduced the Class-Wise Maximum Mean Discrepancy (CWMMD), which modifies the standard MMD to focus on class-conditional distributions. The formulation of CWMMD is given in Equation (16), where X s c and X t c represent the data samples belonging to the c-th class from the source and target domains, respectively; n s c and n s c are the numbers of samples in the c-th class for the source and target domains, respectively; and A is a projection matrix that maps the data into a common subspace where the distributions are compared. The term on the left-hand side of Equation (16) represents the squared Euclidean distance between the class-conditional distributions of the source and target domains after projection by A . The right-hand side expresses this same discrepancy in its matrix trace form, where X s t = X s , X t denotes the combined source and target data and M c is a class-specific matrix that encodes the relationships between pairs of samples from the source and target domains within the same class, as defined in Equation (17).
C W M M D 2 = c = 1 C | | 1 n s c x i X s c A T x i 1 n t c x j X t c A T x j | | 2 2 = c = 1 C t r A T X s t M c X s t T A
M c i j = 1 n s c n s c ,   x i , x j X s c                       1 n t c n t c ,   x i , x j X t c                             1 n s c n t c ,   x i X s c , x j X t c x j X s c , x i X t c                   0 ,   otherwise                                  
To achieve effective transfer learning, Long et al. proposed the Joint Distribution Adaptation (JDA) framework, which combines the marginal MMD (addressed in Equation (14) with the class-conditional CWMMD addressed in Equation (16). The resulting optimization problem is shown in Equation (18), where c = 0 C t r A T X s t M c X s t T A combines the marginal and conditional discrepancies into a single objective, where c = 0 corresponds to the marginal distribution, α A F 2 is a regularization term that controls the scale of the projection matrix A , ensuring the problem is well posed and prevents overfitting, and the constraint A T X s t H s t X s t T A = I K × K restricts the total variation in the projected data to a fixed value, preserving important statistical information. Here, H s t = I n s t × n s t 1 n s t 1 n s t × n s t is a centering matrix that ensures the projected data are centered, with n s t c = n s c + n s c representing the total number of samples. This optimization problem is designed to find the optimal projection matrix A that aligns both marginal and conditional distributions across domains, thereby enabling effective domain adaptation even when the target domain lacks labeled data.
min A c = 0 C t r A T X s t M c X s t T A + α A F 2   s . t .     A T X s t H s t X s t T A = I K × K

2.6. Discriminative Class-Wise MMD Based on Euclidean Distance

The use of MMD aims to extract shared common features between the source and target domains by minimizing the mean difference for each pair of classes, even when their distributions are distinct. How is this achieved in practice? Wang et al. [25] provided valuable insights, illustrating that the principles of MMD closely mirror human transferable learning behaviors. Their approach treats each category as a distinct group, analyzing and adjusting the means of specific categories in both the source and target domains. For example, in the case of a specific class shared by the source and target domains, the category means are progressively aligned by minimizing the mean difference between the pairs, maximizing their intra-class distances. As the domain adaptation (DA) process progresses, the means of these classes from the two domains converge, reducing joint variance and improving feature alignment. This process reflects how humans naturally extract shared features from underlying semantics, capturing broad patterns while forgoing some finer details. The progressive alignment of category means exemplifies how MMD enhances feature generalization across domains, facilitating robust domain adaptation.
Wang et al. [25] presented Lemmas 1–3 as follows, where Lemmas 2 and 3 were both proven by them, and Lemma 1 follows the identity about the inter-class (or between-class) distance according to [32]:
Lemma 1.
The inter-class scatter S b is defined as the squared inter-class distance and can be expressed as
S b = t r A T S b A = 1 n c = 1 C k = c + 1 C n c n k t r A T D c k A ,
where D i j = m i m j m i m j T , S b = i = 1 C n i m i m m i m T is the inter-class scatter matrix, n i is the number of data instances in the i-th category, m i represents the mean of data samples from the i -th category, and m represents the mean of the whole data samples. For brevity, we omit the proofs.
Lemma 2.
The squared inter-class distance equals to the data variance minus the squared intra-class distance:
S b = S v S w ,
where  S v = t r A T S v A is the variance, S w = t r A T S w A is the squared intra-class (or within-class) distance, S v = i = 1 n x i m x i m T , and S w = c = 1 C x j X c x j m x j m T .
Lemma 3.
The following identity describes the Class-Wise Maximum Mean Discrepancy (CWMMD):
C W M M D = c = 1 C t r A T X M c X T A = c = 1 C n s c + n t c n s c n t c t r A T S s t b c A = c = 1 C n s c + n t c n s c n t c t r A T S s t v c A c = 1 C n s c + n t c n s c n t c t r A T S s t w c A ,
where
S s t b c = i s ,   t n i c m i c m s t c m i c m s t c T ,
S s t v c = i = 1 n s t c x i m s t c x i m s t c T ,
and
S s t w c = i s ,   t j = 1 n i c x j m i c x j m i c T .
where  n i c  denotes the number of data instances in the  c -th category from domain i (where i can be either source s or target t), and  n s t c = n s c + n t c . Additionally, m i c represents the mean of data in the c -th category from domain i , while m s t c  denotes the mean of data in the  c -th category from both the source and target domains combined. The subscription st of S s t signifies that both the source and target domains are considered together.
Notably, in this paper, we correct a statement proposed by Wang et al. The original statement, “The inter-class distance equals the data variance minus the intra-class distance”, should be revised to “The squared inter-class distance equals the data variance minus the squared intra-class distance”.
Let S b c   = t r A T X s t M c X s t T A be the squared inter-class distance in the transformed space based on transformation matrix A for class c between the source and target domains. Then, let S b = c = 1 C S b c so that Equation (18) can be written as Equation (25). According to the identity, S b = S v S w , derived by Wang et al. [25], Equation (25) can be written as Equation (26), where S w is the squared intra-class distance between the source and target domains, and S v is their variance. Therefore, minimizing the squared inter-class distance S b is equivalent to maximizing the squared intra-class distance S w while simultaneously minimizing their variance S v , thereby reducing feature distinguishability. To propose a solution, a balance parameter β   ( 1 β 1 ) is directly applied to the hidden squared intra-class distance in S w to regulate its variation, as shown in Equation (27).
min A S b + M M D 2 + α A F 2   s . t .     A T X s t H s t X s t T A = I k × k
min A S v S w + M M D 2 + α A F 2   s . t .     A T X s t H s t X s t T A = I k × k
min A S v + β · S w + M M D 2 + α A F 2   s . t .     A T X s t H s t X s t T A = I k × k

2.7. Discriminative Class-Wise MMD Based on Gaussian Kernels

Wang et al. [25] extended the work of Long et al. [24] by introducing a discriminative Class-Wise MMD, which retains the use of linear transformations to project samples into the feature space and employs the Euclidean distance to measure the mean difference between the distributions of samples from two domains. However, linear transformations are generally less effective and efficient compared to nonlinear transformations, such as those applied in the Reproducing Kernel Hilbert Space (RKHS), where more complex patterns and relationships between domains can be captured.
In our previous research [28], we redefined the MMD proposed by Wang et al. by incorporating a Gaussian kernel within the RKHS framework, as RKHS based on the Gaussian kernel is universal [33]. This modification enables the MMD to be computed more efficiently and flexibly using the kernel trick, enhancing its applicability to a broader range of scenarios. Firstly, we redefined the squared inter-class distance S b , the squared intra-class distance S w , and the variance S v as S i n t e r ,   S i n t r a , and S v a r , respectively, in Definitions 1 through 3. We then proved that under the Gaussian kernel MMD, the MMD representing the inter-class distance between the source and target domains can be decomposed into the intra-class distance and variance within the source and target domains.
Definition 1.
The squared inter-class distance between the source and target domains is defined as S i n t e r = c = 1 C S s t i n t e r c , where S s t i n t e r c is the squared inter-class distance for class c between the source and target domains, as shown in Equation (28).
S s t i n t e r c = 1 n s c 2 x i , x j X s c x i , x j H + 1 n t c 2 x i , x j X t c x i , x j H 2 n s c n t c x i X s c , x j X t c x i ,   x j H
Definition 2.
The squared intra-class distance between the source and target domains is defined as S i n t r a = c = 1 C S s t i n t r a c , where S s t i n t r a c is the squared intra-class distance for class c between the source and target domains, as shown in Equation (29).
S s t i n t r a c = n s c + n t c n s c n t c x i X s t c x i , x i H 1 n s c x i , x j X s c x i , x j H 1 n t c x i , x j X t c x i , x j H
Definition 3.
The variance between the source and target domains is defined as S v a r = c = 1 C S s t v a r c , where S s t v a r c is the total variance for class c between the source and target domains, as shown in Equation (30).
S s t v a r c = n s c + n t c n s c n t c x j X s t c x j , x j H 1 n s c n t c x i , x j X s c x i , x j H 1 n s c n t c x i , x j X t c x i , x j H 2 n s c n t c x i X s c , x j X t c x i , x j H
Theorem 1.
S i n t e r = S v a r S i n t r a .
Our previous work [28] established Theorem 1 and provided proof. Traditional MMD, without categorization, is the inter-class distance of samples from the two domains, referred to as marginal MMD, defined in Equation (31). Class-Wise MMD refers to the inter-class distance of samples from specific categories in the two domains, termed conditional MMD. For example, S i n t e r c is the squared MMD or the squared inter-class distance for class c between the two domains and can be defined as M M D c 2 = S i n t e r c . As a result, the loss function L c w m m d based on Class-Wise Maximum Mean Discrepancy is defined as the sum of and the squared inter-class distance S i n t e r and the squared marginal MMD, as shown in Equation (32), which can also be written as Equation (33) according to Theorem 1. To address the reduction in feature discriminability, we adopt the strategy proposed by Wang et al. [25], introducing a balance parameter β ( 1 β 1 ) to the hidden squared intra-class distance within the squared inter-class distance S i n t e r . This modification adjusts the loss function, resulting in L d c w m m d , as shown in Equation (34).
M M D 2 X s , X t = 1 n s 2 x i X s x j X s x i , x j H + 1 n t 2 x i X t x j X t x i , x j H 2 n s n t x i X s x j X t x i , x j H
L c w m m d = S i n t e r + M M D 2 X s , X t
L c w m m d = S v a r S i n t r a + M M D 2 X s , X t
L d c w m m d = S v a r + β · S i n t r a + M M D 2 X s , X t

3. The Proposed Method

The proposed unsupervised domain adaptation (UDA) approach primarily utilizes Discriminative Class-Wise Maximum Mean Discrepancy (MMD) to align the class-level data distributions of the source and target domains, which addresses the issue of reduced feature distinguishability when MMD minimizes the mean deviation between two different domains, thereby effectively achieving the goal of UDA. However, the MMD used here is learned with a meta-module MTMMD to obtain MMD with deep kernels (MMDDK). The framework of the proposed method is directly called MeTa Discriminative Class-Wise MMD (MCWMMD), as shown in Figure 1. The orange block represents the feature extractor F, which is responsible for extracting domain-invariant features. The green block represents the classifier C, which predicts class labels based on the extracted features. The light red block represents the meta-module MTMMD, referred to as “MMDDK”, which is designed to measure the distance between feature distributions of samples from the two domains.
The training objective of the meta-module MTMMD is to enhance its ability to discriminate between the two domains. This is achieved by updating MMDDK to maximize the feature distance measurement values of samples from the two domains. Conversely, the training objective of the MCWMMD module is to update the feature extractor F so that the feature distance measurement values of samples from the two domains, computed using the current MTMMD, are minimized.
These opposing training objectives result in a process resembling adversarial training, where the two modules iteratively adjust to counteract each other. This alternating training process allows each module to improve its performance while balancing the influence of the other. The remainder of this section will introduce the detailed training processes of these two modules.

3.1. Deep Kernel Training Network

According to the Maximum Mean Discrepancy with Deep Kernels (MMDDK) defined by Liu et al. [23], as explained in Equation (7), we construct a training network for MMDDK, as depicted in Figure 2. The input to the training network for MMDDK is the feature vector z = F x extracted from the MCWMMD network, where two vectors, z s = F x s and z t = F x t (referred to as first-order features), are used to compute the Gaussian function value q z s ,   z t in Equation (9). Additionally, they are separately input into another feature extractor F d to obtain z ^ s = F d z s and z ^ t = F d z t (referred to as second-order features), which are used to compute the Gaussian function value κ z ^ s , z ^ t in Equation (8). Subsequently, the two Gaussian function values q z s ,   z t and κ z ^ s , z ^ t are combined using the operator k ω defined in Equation (7) to calculate the deep kernel distance k ω z s ,   z t . The parameters σ ϕ , σ q , weight ϵ , and network parameters θ d of the network F d are jointly trained using this deep neural network, denoted as F ω , where ω = θ d ,   σ ϕ ,   σ q ,   ϵ . Its training is based on maximizing the objective function J λ in Equation (10). The network training algorithm is presented in Algorithm 1.
Algorithm 1 Training the MMDDK
Input: η 1 ;
Initialize  ω ( θ d ,   σ ϕ ,   σ q ,   ϵ ) ;
repeat until convergence
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
   Z s F X s ; Z t F X t ;
   Z ^ s F d Z s ; Z ^ t F d Z t ;
   M ω M M D D K u 2 Z s , Z t ; k ω ;#using (11)
   V ω σ 2 Z s , Z t ; k ω ;#using (12) with λ = 0
   J ω M ω / V ( ω )  ;#using (10)
  #update parameters:
   ω ω + η 1 ω J ω ;#maximizing J
end repeat

3.2. Meta-Learning of Maximum Mean Discrepancy

In this section, we redefine and compute the Maximum Mean Discrepancy (MMD) originally defined and calculated by Wang et al. [25] in the context of linear transformation spaces, but now in the Reproducing Kernel Hilbert Space (RKHS) using the kernel trick for straightforward MMD computation. Consequently, we also redefine their definitions of between-class distance squared ( S i n t e r ), within-class distance squared ( S i n t r a ), and variance ( S v a r ), and demonstrate that under the Gaussian kernel-based MMD, the between-class distance-squared MMD between the source and target domains can be decomposed into the sum of within-class distance-squared MMDs from both domains and their variance difference.
As described in Section 2, minimizing the between-class distance squared S i n t e r is equivalent to maximizing the within-class distance squared S i n t r a for both the source and target domains while simultaneously minimizing their total variance S v a r , which leads to decreased feature discriminability. To address this issue, a balancing parameter β ( 1 β 1 ) is applied to the hidden within-class distance squared S i n t r a within S i n t e r , proposing the discriminability class-level loss function L d c w m m d defined in Equation (34), which can be rewritten as Equation (35).
For convenience, let us define M M D 0 = S s t i n t e r 0 = M M D . Hence, Equation (35) can also be rewritten as Equation (36), where the second term represents the sum of marginal MMD and conditional MMD, and the coefficient β = β + 1 adjusts between 0 and 2, i.e., 0 β 2 .
L d c w m m d = β + 1 · S i n t r a + S v a r S i n t r a + M M D 2 Z s , Z t = β + 1 · S i n t r a + S i n t e r + M M D 2 Z s , Z t
L d c w m m d = β · c = 1 C S s t i n t r a c + c = 0 C M M D c 2
Although we adopt the Discriminative Class-Wise Maximum Mean Discrepancy (DCWMMD), where MMD is computed based on a Gaussian function in a Reproducing Kernel Hilbert Space (RKHS), there is no reliable method to select the appropriate bandwidth value for the Gaussian function. Therefore, in this study, we choose the Maximum Mean Discrepancy with Deep Kernels (MMDDK) proposed by Liu et al. [23], where the bandwidth is learned by the network, endowing the MMDDK with stronger discriminative power. To further adapt the MMDDK to the mean discrepancy calculations for different domain pairs, we employ meta-learning to learn this MMDDK, resulting in a method called MeTa Maximum Mean Discrepancy (MTMMD), which is more suitable for efficient optimization using gradient descent [34,35].
Our proposed MTMMD network architecture, as shown in Figure 3, is based on concepts similar to previous meta-learning loss functions [36]. It parameterizes the Maximum Mean Discrepancy through a neural network F ψ , which receives the second-order features Z ^ s and Z ^ t predicted by the MMDDK model F ψ , along with bandwidths σ ϕ and σ q and the weight ϵ . We aim to learn the parameters ψ such that when ω = θ d ,   σ ϕ ,   σ q ,   ϵ is updated through F ψ , the final performance is optimal. The learning of parameters ω = θ d ,   σ ϕ ,   σ q ,   ϵ involves maximizing not only the original objective function J but also the meta-learning objective function J m t output by F ψ = M ψ ,   V ψ . The parameters ω and ψ are alternately updated, as shown in Equations (37) and (38).
The primary goal of both updates is to maximize the value of the mean discrepancy function; hence, we aim for both J m t and J to be maximized, with the parameter adjustments being positive multiples of the partial derivatives. The MTMMD network training architecture is illustrated in Figure 3. Since MMDDK and MTMMD are trained together, the gradient values of the MMDDK objective function J are also used to update its parameters ω , modifying Equations (37)–(39). The MTMMD network training and inference algorithms are presented in Algorithms 2 and 3, respectively. Subsequently, the domain adaptation training uses the two-domain mean discrepancy loss function, where the M M D c 2 in L d c w m m d is replaced by the meta-learning M m t c in Equation (40), resulting in the loss function L m t d c w m m d in Equation (41).
ω t + 1 ω t + α 1 · J m t F ψ F ω t Z s , Z t ; k ω t ω t
ψ t + 1 ψ t + α 2 · J F ω t + 1 Z s , Z t ; k ω t + 1 ψ t
ω t + 1 ω t + α 0 · J F ω t Z s , Z t ; k ω t ω t + α 1 · J m t F ψ F ω t Z s , Z t ; k ω t ω t
M m t c = M ψ ( F ω Z s c , Z t c ; k ω
L m t d c w m m d = β · c = 1 C S s t i n t r a c + c = 0 C M m t c
The MMDDK model F ω passes its predictions F ω t to the meta-module MTMMD F ψ , which outputs F ψ t = M m t t , V m t t , where M m t t is the mean discrepancy value and V m t t is the variance. We optimize ψ to ensure that when optimizing the MMDDK model for J m t with F ω t + 1 Z s , Z t ; k ω t + 1 , the updated ω t + 1 performs better (i.e., there is a higher value of the MMDDK objective function J). To achieve this, we take a gradient step on the meta-module’s objective function J m t to update the MMDDK model parameters ω t + 1 , and then we update ψ by evaluating ω t + 1 using the MMDDK objective function J .
Algorithm 2 Training MTMMD
Input: α 0 , α 1 , α 2 ;
Initialize  ω and ψ : sets of parameters for MMDDK Model F ω and MTMDD Model F ψ , T 10 , 000 ; # ω = ( θ d ,   σ ϕ ,   σ q ,   ϵ ) ;
for t  0 to T do
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
    X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
    Z s F X s ; Z t F X t ;
    Z ^ s , Z ^ t , σ ϕ , σ q , ϵ F ω Z s , Z t ; k ω ; # Z ^ s = F d Z s ; Z ^ t = F d Z t ;
   #alternatively update parameters ω and ψ:
    M M M D D K u 2 Z s , Z t ; k ω ;#using (11)
    V σ 2 Z s , Z t ; k ω ;#using (12) with λ = 0
    J M / V ; #using (10)
   if t is even then
      M m t , V m t F ψ Z ^ s , Z ^ t , σ ϕ , σ q , ϵ ;
      J m t M m t / V m t ;
      ω ω + α 0 ω J + α 1 ω J m t ;#maximizing  J m t
   else
      ω ω + α 2 ψ J ;#maximizing J
end for
Algorithm 3 MTMMD Inferencing
Input:  ω and ψ ;
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N mini-batch from Δ t ;
   Z s F X s ; Z t F X t ;
   Z ^ s , Z ^ t , σ ϕ , σ q , ϵ F ω Z s , Z t ; k ω ; # Z ^ s = F d Z s ; Z ^ t = F d Z t ;
   M m t , V m t F ψ Z ^ s , Z ^ t , σ ϕ , σ q , ϵ ;
return  M m t

3.3. MeTa Discriminative Class-Wise Maximum Mean Discrepancy

The proposed UDA approach, based on MeTa Discriminative Class-Wise Maximum Mean Discrepancy (MCWMMD), includes a feature extractor F for extracting domain-invariant features for the classifier C , as shown in Figure 1. Inputs x s and x t are fed into the feature extractor F , resulting in outputs z s = F x s and z s = F x t . These outputs are then input into the classifier C for classification predictions, producing l ^ s = C z s and l ^ t = C z t . In practice, the batch size for both the source domain and the target domain is set to N , with a total of C category labels. The feature extractor F extracts features from input samples X s = x s i i = 1 N and X t = x t j j = 1 N , and outputs Z s = z s i i = 1 N and Z t = z t j j = 1 N , respectively. These features, Z s and Z t , are then input into the classifier C for classification. In the diagram, F and C are depicted twice to correspond to the data paths of the source and target domains, with a dashed line in between to indicate shared parameters. The MTMMD network will be trained by minimizing the total loss function L t o t a l , as defined in Equation (42), where L d c w m t m m d is defined in Equation (41) and L c l s l s is defined in Equation (44). It is a label-smoothed version of the classification cross-entropy in Equation (43), designed to encourage samples to fall into compact, uniform, and well-separated clusters. The original prediction y s i is replaced by 1 α y s i c + α / C , where 1 is a vector of ones with C dimensions, and α is the smoothing parameter. Additionally, L e n t represents the predicted label entropy of the target sample, as shown in Equation (45). The network training algorithm for this MCWMMD module is presented in Algorithm 4.
L t o t a l = L m t d c w m m d + ω 2 · L c l s l s + ω 3 · L e n t
L c l s Z s , Y s = 1 n s i = 1 n s c = 1 C y s i c log l ^ s i c
L c l s l s Z s , Y s = 1 n s i = 1 n s c = 1 C 1 α y s i c + α / C log l ^ s j c
L e n t Z t = 1 N j = 1 N c = 1 C l ^ t j c log l ^ t j c
Algorithm 4 Training MCWMMD model
Input: Δ s , Δ t , β 1 , β 2 , η 2 ;
Initialize parameters θ F  and  θ C ;
# train the model parameters θ F  and  θ C  on  Δ s  and Δ t ;  
repeat until convergence
   X s , Y s = x s 1 , y s 1 , x s 2 , y s 2 , , x s N , y s N mini-batch from Δ s ;
   X t = x t 1 , x t 2 , , x t N  mini-batch from Δ t ;
   Z s  F X s ; Z t  F X t ;
  #generate pseudo labels:
   L ^ t = l ^ t 1 , l ^ t 2 , , l ^ t N     C F X t ;#classify target samples
   Y t = y t 1 , y t 2 , , y t N psd l ^ t 1 , psd l ^ t 2 , , psd l ^ t N ; #obtain pseudo labels
  # psd (( v 1 , v 2 , , v C )) = argmax 1 c C v c ;
  # evaluate losses:
   L m c w m m d X s , X t = β · c = 1 C S s t i n t r a c + c = 0 C M m t c ;#using (41)
   L c l s l s 1 N i = 1 N c = 1 C 1 α y s i c + α / C log l ^ s j c ; #using (44)
   L e n t   1 N j = 1 N c = 1 C l ^ t j c log l ^ t j c ;#using (45)
   L t o t a l L m c w m m d + β 1 L c l s l s + β 2 L e n t ; #using (42)
  # update  θ F   and   θ C   to minimize L t o t a l ;
   θ F θ F η 2   θ F L t o t a l ;
   θ C θ C η 2 θ C L t o t a l ;
end repeat

4. Experimental Results

This section presents a comprehensive evaluation of the JDA approach on standard UDA datasets for image classification tasks. The details of the data preparation process are outlined in Section 4.1, while the experimental setup, including model configurations and parameters, is discussed in Section 4.2. Finally, Section 4.3 provides the experimental results and comparisons with baseline methods to demonstrate the effectiveness of the proposed approach.

4.1. Data Preparation

The proposed approach was evaluated on both digit and office object datasets. The digit datasets used in this study included the MNIST (Modified National Institute of Standards and Technology) database [37], USPS (U.S. Postal Service) [38], and SVHN (Street View House Numbers) [39]. The MNIST and USPS consist of grayscale images of handwritten digits, with the MNIST offering 60,000 training samples and 10,000 testing samples and USPS comprising 9298 images, divided into 7291 training and 2007 testing samples. In contrast, SVHN provides 73,257 color training images and 26,032 testing images, depicting digits captured in a street-view context. Figure 4 shows sample images from the MNIST, USPS, and SVHN, with training samples highlighted in blue.
For the office object datasets, we used Office-31 [40] and Office-Home [41]. The Office-31 dataset consists of 4652 images within 31 categories collected from three distinct domains: Amazon (A), which contains images from online merchants; DSLR (D), with high-resolution images taken using a digital SLR camera; and Webcam (W), featuring low-resolution images captured using a web camera. This dataset covers 31 common office object categories, totaling 4110 images. The Office-Home dataset introduces a more complex domain shift, with four distinct domains—Art (Ar), Clipart (Cl), Product (Pr), and Real World (Rw)—spanning 65 object categories and approximately 15,500 images, each offering varied visual styles. Figure 5 and Figure 6 provide sample images from the Office-31 and Office-Home datasets, respectively.

4.2. Experimental Setting

An initial learning rate of 0.001 was used for all experiments, decayed by a factor of 0.1 every 10 epochs. The batch size was set to 128 for the digit datasets and 64 for the office object datasets. The Adam optimizer was used with parameters β1= 0.99 and β2 = 0.999, and it was chosen for its ability to handle sparse gradients. Training lasted for 50 epochs on the digit datasets and 100 epochs on the office object datasets to ensure convergence. A regularization term of 0.0005 was applied to prevent overfitting. The Gaussian kernel used in the MMD calculations had an initial bandwidth of 1.0, dynamically optimized through the meta-learning framework. At the beginning of each epoch, pseudo-labels for all target domain training data were generated based on the current classifier parameters. This iterative process helped refine domain alignment while maintaining computational efficiency.
Experiments were conducted on a server equipped with NVIDIA RTX 2080 GPUs (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) and 256 GB of system RAM (provided by ADATA, Taiwan). The implementation was carried out using Python with the PyTorch deep learning library (version 1.8), along with NumPy and SciPy for data preprocessing and statistical computations.

4.3. Results

ResNet-18 and ResNet-50 [42] were employed as the network architectures for feature extraction from the digit and office object datasets, respectively. Both models were fine-tuned using pre-trained ImageNet parameters. The performance of the proposed method was evaluated on the above-mentioned datasets: digit datasets, Office-31, and Office-Home. For the digit datasets, we tested domain adaptation between pairs such as MNIST to USPS (M → U), USPS to MNIST (U → M), and SVHN to MNIST (S → M). In the Office-31 dataset, we examined six domain adaptation pairs (e.g., Amazon to DSLR (A → D) and Webcam to DSLR (W → D)). For the Office-Home dataset, we created 12 domain adaptation pairs across four domains (Art, Clipart, Product, and Real-World), including examples like Art to Clipart (Ar → Cl), Product to Real-World (Pr → Rw), and so on.
Table 2 compares our method with several domain adaptation techniques on the digit datasets, including ADDA [5], ADR [43], CDAN [44], CyCADA [12], SWD [45], SHOT [46], and our previous work, DCWMMD [28]. Table 3 provides a comparison of the Office-31 dataset, including methods such as that by Wang et al. [25], DAN [6], DANN [4], ADDA, MADA [47], SHOT, CAN [3], MDGE [2], DACDM [9], CDCL [10], DMP [11], and DCWMMD [28]. Table 4 compares the results of the Office-Home dataset with methods like that used by Wang et al., DAN, DACDM [9], DMP [11], and DCWMMD. Please note that the results are directly referenced from published papers. The best-performing method for each source-to-target combination is highlighted in bold. The bold numbers in the tables indicate the best-performing accuracy for each source-to-target combination. The “Source-only” category represents a classifier trained solely on source data, while “Target-supervised” denotes a classifier trained and tested on target domain data, typically representing lower and upper bounds for domain adaptation performance.
In Table 2, our method achieves an average accuracy of 98.60% across digit datasets, outperforming other methods and closely approaching the target-supervised scenario. This highlights the robustness of our approach in aligning domain distributions and achieving class-wise alignment. Table 3 presents the results of the Office-31 dataset, where our method achieved an average accuracy of 91.62%, consistently outperforming other unsupervised adaptation methods and closely matching the target-supervised benchmark. This result underscores the effectiveness of our Class-Wise MMD optimization method in adapting complex, real-world data. In Table 4, our method achieves an average accuracy of 75.43% on the Office-Home dataset, a challenging multi-domain setting with diverse visual characteristics. These results highlight the adaptability and robustness of our approach as it generalizes effectively across multiple domains and significantly closes the gap with the target-supervised benchmark. This performance demonstrates our method’s capability to handle complex domain shifts while maintaining high accuracy across diverse visual domains.
t-SNE (t-distributed Stochastic Neighbor Embedding) [48] is a nonlinear dimensionality reduction technique commonly used to visualize high-dimensional data in a lower-dimensional space (typically 2D or 3D). By preserving local structures within the data, t-SNE excels in representing clusters and relationships, making it particularly useful for visualizing stochastic settings and complex data distributions. In this study, we employ t-SNE to visualize the feature representations learned by our model for both the source and target domains, highlighting the effectiveness of the proposed domain adaptation approach.
Columns (a) and (b) of Figure 7 depict the distributions of source features and target features, respectively, with different digits represented by distinct colors. Specifically, the 10 colors correspond to the digits 0 through 9, where each color uniquely represents a digit for clear differentiation in the visualization. Column (c) of Figure 7 provides an integrated view of both distributions to highlight their alignment. As observed, the source and target features are well aligned, demonstrating the effectiveness of our approach. The t-SNE visualization effectively highlights the alignment between source and target feature distributions, reflecting the improved feature alignment achieved by our method compared to the baseline.

5. Discussion and Conclusions

The proposed method, MeTa Discriminative Class-Wise MMD (MCWMMD), represents a significant advancement in unsupervised domain adaptation by integrating meta-learning with a Class-Wise Maximum Mean Discrepancy (MMD) approach. While traditional MMD methods align overall distributions between source and target domains, they often fail to achieve precise class-wise alignment, reducing feature distinguishability and generalization performance. MCWMMD addresses these limitations by introducing dynamic kernel adaptability and a focus on class-wise alignment, resulting in robust and domain-invariant representations.
A key innovation of MCWMMD is its dynamic kernel adaptability, achieved through a meta-module that adjusts kernel parameters based on class-specific features. This enables more precise domain alignment compared to traditional static kernels, significantly enhancing alignment and generalization. The alternating training process between the feature extractor and the meta-module, inspired by adversarial training, further refines the model’s ability to handle complex domain shifts. However, this adaptability introduces computational complexity, which could be a limitation for time-sensitive applications. Future research could explore simplifying the meta-module to reduce overhead while preserving adaptability.
The method’s class-wise alignment approach applies MMD in a class-specific manner, ensuring that each class is individually aligned between source and target domains. This produces compact, domain-invariant, and class-discriminative feature clusters, ultimately improving cross-domain classification performance. However, its reliance on accurate pseudo-labels for class-wise alignment may lead to errors when the pseudo-label quality is low. Developing robust pseudo-labeling strategies is a crucial direction for future research.
MCWMMD is also optimized for scalability through efficient batch processing and streamlined meta-module training, enabling practical application to large datasets without compromising alignment accuracy. However, scaling it extremely large or dynamically evolving datasets remains a challenge. Future work could investigate distributed or online learning paradigms to extend the method’s applicability to these scenarios.
Despite its strengths, MCWMMD involves complex meta-module training and adversarial-like processes, which may pose implementation challenges, particularly for practitioners with limited computational resources. Further validation in real-world scenarios with highly diverse and complex domain shifts is also needed. Promising future directions include simplifying the meta-module for enhanced accessibility, improving pseudo-labeling mechanisms, and extending the method to handle online and incremental domain adaptation for dynamic datasets. Additionally, exploring cross-domain generalization to unseen categories or settings could further enhance the method’s adaptability.
In summary, MCWMMD advances unsupervised domain adaptation by combining meta-learning with a Class-Wise MMD approach, addressing the limitations of traditional techniques. Its dynamic kernel adaptability and focus on class-wise alignment enable robust feature alignment and generalization. While challenges such as computational complexity and reliance on pseudo-labels remain, MCWMMD provides a strong foundation for future innovations, paving the way for more adaptable and generalizable deep learning models.

Author Contributions

Conceptualization, H.-W.L., C.-T.T. and H.-J.L.; Methodology, H.-W.L.; Software, T.-T.H. and C.-T.T.; Validation, H.-W.L. and C.-T.T.; Formal Analysis, H.-J.L.; Investigation, T.-T.H. and H.-W.L.; Resources, T.-T.H.; Data Curation, C.-H.Y.; Writing—Original Draft Preparation, H.-J.L.; Writing—Review and Editing, H.-W.L.; Visualization, C.-H.Y.; Supervision, H.-W.L.; Project Administration, H.-J.L.; Funding Acquisition, H.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, R.O.C., under grant NSTC 113-2221-E-032-020.

Data Availability Statement

(1) SVHN dataset [39]: available online: https://www.openml.org/search?type=data&sort=runs&id=41081&status=active (accessed on 1 March 2024). (2) Office-31 dataset [40]: Introduced by Kate Saenko et al. in Adapting Visual Category Models to New Domain. (3) Office-Home dataset [41]. Available online: https://www.hemanthdv.org/officeHomeDataset.html (accessed on 1 March 2024). (4) t-SNE [48]: available online: http://www.jmlr.org/papers/v9/vandermaaten08a.html (accessed on 1 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lin, Y.; Chen, J.; Cao, Y.; Zhou, Y.; Zhang, L.; Tang, Y.Y.; Wang, S. Cross-domain recognition by identifying joint subspaces of source domain and target Domain. IEEE Trans. Cybern. 2017, 47, 1090–1101. [Google Scholar] [CrossRef]
  2. Khan, S.; Guo, Y.; Ye, Y.; Li, C.; Wu, Q. Mini-batch dynamic geometric embedding for unsupervised domain adaptation. Neural Process. Lett. 2023, 55, 2063–2080. [Google Scholar] [CrossRef]
  3. Zhang, W.; Ouyang, W.; Li, W.; Xu, D. Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
  4. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar] [CrossRef]
  5. Tzeng, E.; Hoffman, J.; Saenko, K.; Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2962–2971. [Google Scholar] [CrossRef]
  6. Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 97–105. [Google Scholar]
  7. Vettoruzzo, A.; Bouguelia, M.-R.; Rögnvaldsson, T.S. Meta-learning for efficient unsupervised domain adaptation. Neurocomputing 2024, 574, 127264. [Google Scholar] [CrossRef]
  8. Liu, X.; Yoo, C.; Xing, F.; Oh, H.; El Fakhri, G.; Kang, J.-W.; Woo, J. Deep Unsupervised Domain Adaptation: A Review of Recent Advances and Perspectives. arXiv 2022, arXiv:2208.07422v1. [Google Scholar] [CrossRef]
  9. Zhang, Y.; Chen, S.; Jiang, W.; Zhang, Y.; Lu, J.; Kwok, J.T. Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation. arXiv 2023, arXiv:2309.14360v1. [Google Scholar] [CrossRef] [PubMed]
  10. Wang, R.; Wu, Z.; Weng, Z.; Chen, J.; Qi, G.-J.; Jiang, Y.-G. Cross-Domain Contrastive Learning for Unsupervised Domain Adaptation. arXiv 2022, arXiv:2106.05528v2. [Google Scholar] [CrossRef]
  11. Luo, Y.-W.; Ren, C.-X.; Dai, D.-Q.; Yan, H. Unsupervised Domain Adaptation via Discriminative Manifold Propagation. IEEE Trans. Pattern Anal. Machine Intell. 2022, 44, 1653–1669. [Google Scholar] [CrossRef] [PubMed]
  12. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
  13. Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. arXiv 2018, arXiv:1712.02560v4. [Google Scholar] [CrossRef]
  14. Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster R-CNN for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
  15. Si, S.; Tao, D.; Geng, B. Bregman divergence based regularization for transfer subspace learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 929–942. [Google Scholar] [CrossRef]
  16. Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Wortman, J. Learning bounds for domain adaptation. In Proceedings of the Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 129–136. [Google Scholar]
  17. Ding, Z.; Fu, Y. Robust transfer metric learning for image classification. IEEE Trans. Image Process. 2017, 26, 660–670. [Google Scholar] [CrossRef]
  18. Gretton, A.; Borgwardt, K.; Rasch, M.; Sch, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  19. Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 2011, 22, 199–210. [Google Scholar] [CrossRef] [PubMed]
  20. Song, L.; Gretton, A.; Bickson, D.; Low, Y.; Guestrin, C. Kernel belief propagation. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 707–715. [Google Scholar]
  21. Park, M.; Jitkrittum, W.; Sejdinovic, D. K2-ABC: Approximate bayesian computation with kernel embeddings. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 398–407. [Google Scholar]
  22. Li, Y.; Swersky, K.; Zemel, R.S. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1718–1727. [Google Scholar]
  23. Liu, F.; Xu, W.; Lu, J.; Zhang, G.; Gretton, A.; Sutherland, D. Learning deep kernels for non-parametric two-sample tests. arXiv 2020, arXiv:2002.09116. [Google Scholar]
  24. Long, M.; Wang, J.; Ding, G.; Sun, J.; Yu, P.S. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; IEEE Computer Society: Sydney, Australia, 2013; pp. 2200–2207. [Google Scholar]
  25. Wang, W.; Li, H.; Ding, Z.; Wang, Z. Rethink maximum mean discrepancy for domain adaptation. arXiv 2020, arXiv:2007.00689. [Google Scholar] [CrossRef]
  26. Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
  27. Baraud, Y.; Birgé, L. Rho-estimators revisited: General theory and applications. Ann. Statist. 2018, 46, 3767–3804. [Google Scholar] [CrossRef]
  28. Lin, H.-W.; Tsai, Y.; Lin, H.J.; Yu, C.-H.; Liu, M.-H. Unsupervised domain adaptation deep network based on discriminative class-wise MMD. AIMS Math. 2024, 9, 6628–6647. [Google Scholar] [CrossRef]
  29. Andrychowicz, M.; Denil, M.; Colmenarejo, S.G.; Hoffman, M.W.; Pfau, D.; Schaul, T.; de Freitas, N. Learning to learn by gradient descent by gradient descent. arXiv 2016, arXiv:1606.04474v2. [Google Scholar] [CrossRef]
  30. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  31. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  32. Zheng, S.; Ding, C.; Nie, F.; Huang, H. Harmonic mean linear discriminant analysis. IEEE Trans. Knowl. Data Eng. 2019, 31, 1520–1531. [Google Scholar] [CrossRef]
  33. Borgwardt, K.M.; Gretton, A.; Rasch, M.J.; Kriegel, H.-P.; Sch, B.; Smola, A.J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006, 22, e49–e57. [Google Scholar] [CrossRef]
  34. Gao, W.; Shao, M.; Shu, J.; Zhuang, X. Meta-BN Net for few-shot learning. Front. Comput. Sci. 2023, 17, 171302. [Google Scholar] [CrossRef]
  35. Bechtle, S.; Molchanov, A.; Chebotar, Y.; Grefenstette, E.; Righetti, L.; Sukhatme, G.; Meier, F. Meta-learning via learned loss. arXiv 2019, arXiv:1906.05374. [Google Scholar]
  36. Müller, R.; Kornblith, S.; Hinton, G. When does label smoothing help? In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  37. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  38. Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Machine Intell. 1994, 16, 550–555. [Google Scholar] [CrossRef]
  39. SVHN Dataset. Available online: https://www.openml.org/search?type=data&sort=runs&id=41081&status=active (accessed on 1 March 2024).
  40. Saenko, K.; Kulis, B.; Fritz, M.; Darrell, T. Adapting visual category models to new domains. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314, pp. 213–226. [Google Scholar] [CrossRef]
  41. Office-Home Dataset. Available online: https://www.hemanthdv.org/officeHomeDataset.html (accessed on 1 March 2024).
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  43. Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Adversarial dropout regularization. arXiv 2018, arXiv:1711.01575. [Google Scholar] [CrossRef]
  44. Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1647–1657. [Google Scholar]
  45. Lee, C.Y.; Batra, T.; Baig, M.H.; Ulbricht, D. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10285–10295. [Google Scholar]
  46. Liang, J.; Hu, D.; Feng, J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 6028–6039. [Google Scholar]
  47. Pei, Z.; Cao, Z.; Long, M.; Wang, J. Multi-adversarial domain adaptation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
  48. van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. Available online: http://www.jmlr.org/papers/v9/vandermaaten08a.html (accessed on 1 March 2024).
Figure 1. Training framework for MCWMMD.
Figure 1. Training framework for MCWMMD.
Mathematics 13 00226 g001
Figure 2. Training framework for MMDDK.
Figure 2. Training framework for MMDDK.
Mathematics 13 00226 g002
Figure 3. MTMMD network training process.
Figure 3. MTMMD network training process.
Mathematics 13 00226 g003
Figure 4. Digit data: (a) MNIST, (b) USPS, and (c) SVHN.
Figure 4. Digit data: (a) MNIST, (b) USPS, and (c) SVHN.
Mathematics 13 00226 g004
Figure 5. Office-31 data: (a) Webcam, (b) DSLR, and (c) Amazon.
Figure 5. Office-31 data: (a) Webcam, (b) DSLR, and (c) Amazon.
Mathematics 13 00226 g005
Figure 6. Office-Home data.
Figure 6. Office-Home data.
Mathematics 13 00226 g006
Figure 7. t-SNE visualization of three tasks on digit datasets: (a) source, (b) target, and (c) source (red color) + target (blue color) (best viewed in color).
Figure 7. t-SNE visualization of three tasks on digit datasets: (a) source, (b) target, and (c) source (red color) + target (blue color) (best viewed in color).
Mathematics 13 00226 g007
Table 1. Parameters and variables.
Table 1. Parameters and variables.
SymbolMeaning
X s Set of samples from the source domain.
X t Set of samples from the target domain.
X s c Set of samples of class c from the source domain.
X t c Set of samples of class c from the target domain.
X c Union of source and target samples for class c.
X Union of all samples from source and target domains.
x s A single sample from the source domain.
x t A single sample from the target domain.
Z s Set of feature vectors of samples from the source domain.
Z t Set of feature vectors of samples from the target domain.
z s Feature vector of sample x s from the source domain.
z t Feature vector of sample x t from the target domain.
n s ,   n t Number of samples in the source and target domains, respectively.
n s c ,   n t c Number of samples of class c in the source and target domains, respectively.
m s ,   m t Mean of samples in the source and target domains, respectively.
m s c , m t c Mean of samples of class c in the source and target domains, respectively.
h x Deep kernel function mapping features into latent space.
Θ Set of parameters of the feature extractor network.
γ ,   λ Hyperparameters for balancing loss components.
η Learning rate for optimization.
Table 2. Accuracies (%) of several approaches on some digit datasets.
Table 2. Accuracies (%) of several approaches on some digit datasets.
Source Target Methods M U U M S MAverage
Source-only69.682.267.173.0
ADDA [5]90.189.476.085.2
ADR [43]93.193.295.093.8
CDAN [44]98.095.689.294.3
CyCADA [12]96.595.690.494.2
SWD [45]97.198.198.998.0
SHOT [46]97.897.699.098.1
DCWMMD [28]98.098.298.898.3
MCWMMD98.598.398.998.6
Target-supervised98.999.499.499.2
Table 3. Accuracies (%) for domain adaptation experiments on the Office-31 dataset.
Table 3. Accuracies (%) for domain adaptation experiments on the Office-31 dataset.
Methods A D A W D A D W W A W DAverage
Source-only68.9068.4062.5096.7060.7099.3076.10
Wang et al. [25]90.8088.9075.4898.5075.2099.8088.10
DAN [6]78.6080.5063.6097.1062.8099.6080.40
DANN [4]79.7082.0068.2096.9067.4099.1082.20
ADDA [5]77.8086.2069.5096.2068.9098.4082.90
MADA [47]87.8090.0070.3097.4066.4099.6085.20
SHOT [46]93.9090.1075.3098.7075.0099.9088.80
CAN [3]95.0094.5078.0099.1077.0099.8090.60
MDGE [2]90.6089.4069.5098.9068.4099.8086.10
DACDM [9]95.3195.5178.2698.5878.4399.9391.01
CDCL [10]96.0096.0077.2099.2075.5010090.60
DMP [11]91.0093.0071.4099.0070.2010087.40
DCWMMD [28]96.3094.9077.9099.5076.5099.6090.80
MCWMMD96.7096.6078.4099.6078.6099.8391.62
Target-supervised98.0098.7086.0098.7086.0098.0094.30
Table 4. Accuracies (%) for domain adaptation experiments on the Office-Home dataset.
Table 4. Accuracies (%) for domain adaptation experiments on the Office-Home dataset.
Methods Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw PrAverage
source-only28.0738.3042.0526.1540.5739.1425.9428.4046.6127.1030.1255.3535.65
Wang et al. [25]58.4477.7979.3261.6072.8173.0362.7155.3378.9170.4260.0983.2469.47
DAN [6]43.6057.0067.9045.8056.5060.4044.0043.6067.7063.1051.5074.3056.28
DANN [4]45.6059.3070.1047.0058.5060.9046.1043.7068.5063.2051.8076.8057.63
DACDM [9]60.9479.2783.3469.6781.5380.4065.0658.9783.4575.9065.6185.9974.18
DMP [11]59.0081.2086.3068.1072.8078.8071.2057.6084.9077.3061.5082.9073.50
DCWMMD [28]59.6980.2381.3170.2479.4582.6569.2057.8785.1274.8064.2083.1473.99
MCWMMD62.2181.4683.9271.4579.9883.4271.0859.1285.6476.1065.3285.4875.43
target-supervised93.2492.3592.0891.1692.3592.0891.1693.2492.0891.1693.2492.3592.21
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, H.-W.; Ho, T.-T.; Tu, C.-T.; Lin, H.-J.; Yu, C.-H. MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks. Mathematics 2025, 13, 226. https://doi.org/10.3390/math13020226

AMA Style

Lin H-W, Ho T-T, Tu C-T, Lin H-J, Yu C-H. MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks. Mathematics. 2025; 13(2):226. https://doi.org/10.3390/math13020226

Chicago/Turabian Style

Lin, Hsiau-Wen, Trang-Thi Ho, Ching-Ting Tu, Hwei-Jen Lin, and Chen-Hsiang Yu. 2025. "MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks" Mathematics 13, no. 2: 226. https://doi.org/10.3390/math13020226

APA Style

Lin, H.-W., Ho, T.-T., Tu, C.-T., Lin, H.-J., & Yu, C.-H. (2025). MeTa Learning-Based Optimization of Unsupervised Domain Adaptation Deep Networks. Mathematics, 13(2), 226. https://doi.org/10.3390/math13020226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop