Fuzzy Graph Learning Regularized Sparse Filtering for Visual Domain Adaptation

: Distribution mismatch can be easily found in multi-sensor systems, which may be caused by different shoot angles, weather conditions and so on. Domain adaptation aims to build robust classiﬁers using the knowledge from a well-labeled source domain, while applied on a related but different target domain. Pseudo labeling is a prevalent technique for class-wise distribution alignment. Therefore, numerous efforts have been spent on alleviating the issue of mislabeling. In this paper, unlike existing selective hard labeling works, we propose a fuzzy labeling based graph learning framework for matching conditional distribution. Speciﬁcally, we construct the cross-domain afﬁnity graph by considering the fuzzy label matrix of target samples. In order to solve the problem of representation shrinkage, the paradigm of sparse ﬁltering is introduced. Finally, a uniﬁed optimization method based on gradient descent is proposed. Extensive experiments show that our method achieves comparable or superior performance when compared to state-of-the-art works.


Introduction
Visual recognition is a challenging subject and has attracted increasing attention of researchers over the world. Despite the success have achieved, the performance of leading models may degrade when applied on a cross-domain scenario, including but not limited to image classification [1], semantic segmentation [2] and person re-identification [3]. Such degradation stems from distribution mismatch, which is very common in real life. For images drawn from different sources, there are many factors leading to mismatches in distribution, including background, shooting angle, illumination and so on. Domain adaptation (DA) [4] aims at leveraging the rich knowledge (e.g., labels) in source domain to build classifiers and apply in a different but related target domain (with less or no labels), which has achieved dramatic success in computer vision [5]. Relevant researches can be divided into two main categories according to whether there are labels existing in target domain: Unsupervised domain adaptation (UDA), which means that there are no labels at all; Semi-supervised domain adaptation (SSDA), which supposes that part of labels in target domain are visible. In this paper, we focus on UDA problems. It is worth emphasizing that UDA follows the setting of transductive learning, data from both domains is available during training.
Mainstream UDA works focus on learning domain-invariant representations, they map original input into feature space with the goal of reducing domain discrepancy. Moment-based statistics are often employed to measure how the two domains (or distributions) like each other, such as Maximum Mean Discrepancy (MMD) [6] and CORrelation ALignmen (CORAL) [7]. As a practical example, Transfer Component Analysis (TCA) [8] attempted to minimize MMD by projecting features to a reproducing kernel Hilbert space (RKHS). CORAL matched the second-order statistics of two domains' distributions by means of linear transformation. Furthermore, the combination of moment-based statistics and deep neural networks also shows excellent performance. Another kind of practical DA solution is pseudo labels-based method, which assigns pseudo labels to target samples so that class-wise adaptation is possible, also made a lot of advances [9]. The last one is adversarial training, both feature extractor and domain discriminator are constructed in the beginning, then they are updated asynchronously during training. Adversarial training can be applied as moment-based statistics, the principle difference is that a learning-based indicator is employed to measure domain discrepancy rather than handcrafted [10,11].
Though previous DA approaches have shown superior performance in some respects, there are two major drawbacks as follows: Firstly, moment-based and adversarial training-based DA methods are able to reduce distribution discrepancy between domains, but they can hardly guarantee that samples belonging to the same category between two domains are similar. For instance, studies indicated that the class weight bias would lead to significant performance reduction [12]. Secondly, for pseudo label-based DA methods, wrongly-chosen labels are inevitable, which will ruin further analysis. Some efforts have been paid on the selection of labels, which adopt an incremental framework. The underlying philosophy is that the classifier is weak in the beginning, so less labels should be picked, as the training process goes, the classifier obtains more stronger and more pseudo labels can be thought of high confidence. However, it has not fundamentally solved the problem of wrongly-chosen labels. In this paper, we attempt to handle this situation from another view: to reduce the negative effects of mislabeling. Specifically, we utilize soft-labeling to construct affinity graph, rather than hard-labeling. In order to keep the discrimination power of the learned representation, we regard the fuzzy graph objective as a regularization item and integrate it into the paradigm of sparse filtering [13]. Finally, a unified optimization framework based on gradient descent is proposed. The contributions of this paper are summarized as follows.

•
We propose a novel solution for UDA problems, which is able to learn both discriminative and domain-shared representations simultaneously. Extensive experiments on several real-world datasets demonstrate its superiority to existing works. • Different from previous hard-labeling methods, we design a fuzzy graph regularization based on soft-labeling. Specifically, it attempts to describe cross-domain affinity by means of probability matrix. • In order to deal with the problem of representation shrinkage, we combine the proposed fuzzy graph regularization with unsupervised sparse filtering, so that the two objective will be antagonistic and converge to a compromise.
The rest of this paper is organized as follows. Section 2 gives the background knowledge on existing DA works, especially the highly-correlated pseudo label methods. Then sparse filtering is introduced. In Section 3, we describe our method in detail. Sections 4 and 5 present the experiments and some empirical analysis. Finally, Section 6 concludes the paper and provides some ideas for future researches.

Related Works
In this section, we first give a formal definition of domain adaptation, then conclude existing pseudo label based DA methods. For the sake of better understanding, we introduce sparse filtering subsequently.

Domain Adaptation
Domain adaptation is a branch of transfer learning, which deals with the scenario where training and testing data have different distributions, some literatures call it dataset bias or covariate shift. Specifically, there are two important points in domain adaptation. Matching marginal distributions of source and target domains cannot guarantee the separability of target samples, so pseudo label based DA methods become more popular. Overall, it can be divided into two categories according to whether the method has selection strategies on pseudo labels.

Pseudo Label without Selection
During training, it assigns pseudo labels on the whole target domain without considering the confidence. Class-wise distribution alignment is therefore feasible by utilizing the pseudo labeled target samples together with labeled source samples. After iterative learning, the pseudo labels are expected to be progressively more accurate. Obviously, this expectation is not always held, especially when the initial prediction is weak. As an extension of TCA, Joint Distribution Adaptation (JDA) [14] aligned conditional and marginal distribution simultaneously using pseudo labels, then an iterative training scheme is employed to obtain more accurate labels. Inspired by this, a series of studies tried to modify classical feature extraction methods (e.g., linear discriminant analysis and locality preserving projections [15,16]) by means of pseudo labels.

Pseudo Label with Selection
To alleviate the negative effects caused by wrongly-chosen labels, it selects a subset of target samples with corresponding pseudo labels. Specifically, both labels and their confidences need to be considered and only the selected target samples can be combined with source samples for training. An easy-to-hard training scheme is widely used for DA problems, which adopts an increasing number of target samples. CAPLS is an classical method of selective pseudo label, which uses softmax to denote the classify confidence [17]. Further, SPL considers structure information for more reliable labels [18].

Sparse Filtering
Sparse filtering is an unsupervised feature extraction algorithm. Unlike auto-encoder and sparse coding, it does not seek to model the data explicitly, but attempts to obtain ideal features from the perspective of so-called good features. First of all, it gives three requirements on the question of What is good features.

•
Population Sparsity. Each example should be represented by only a few active features. Ideally, a sample should be a vector which contains lots of zeros and few non-zero values. • Lifetime Sparsity. Good features should be distinguishable, therefore, a feature is only allowed to be activated in few samples. On the contrary, if a feature is activated by all the samples, we cannot classify samples according to this feature, so it is not a good feature. • High Dispersal. It requires each feature to have similar statistical properties across all samples, which seems to be useless. The authors indicate that this can prevent feature degradation, such as similar features across samples.
Ngiam et al. pointed out that we can obtain the features that meet all the requirements by optimizing population sparsity and high dispersal [13]. Suppose now we have n samples, each with m-dimensional features that can be written as The learning paradigm of sparse filtering is as follows: (1) Linear feature extraction and non-linear activation. Let f Then we can use some activation functions to make it more expressive, such as soft-absolute function.
(2) Solving high dispersal. Each feature is divided by the l2−norm of the feature on all samples.
Remember that the requirement of high dispersal is that the statistical properties of each feature are similar. This step forces the sum of the squares of all features to be 1 roughly.
(3) Solving population sparsity. Each sample is divided by its own l2−norm. Then we can get the objective function.
This objective function can be easily understood through an example. Imagine that we have different expressions of a sample, (0.6, 0.8) T and (1, 0) T . We say (1, 0) T is a sparser representation since it has more zero, so it has a smaller loss,

Methodology
In this section, we give the mathematical definition of DA problem, describe each component of the proposed method in detail and analyze the complexity.

Problem Definition and Notations
, · · · , (x n t t )}, the goal is to build classifier and make predictions on target data. Specifically, we study homogeneous DA, which means that the source and target data have the same dimensions, x s , x t ∈ R m . The target labels y t are only available for evaluating the algorithm. Table 1 introduces some necessary notations and descriptions used in this paper.
Domain: a domain, D, consists of two parts, data X and the corresponding distribution p(X ), so D = {X , p(X )}. Generally, we have D s and D t to represent the labeled source and unlabeled target domain, respectively. When it refers to domain adaptation, we have p(X s ) = p(X t ). It is worth emphasizing to obtain the explicit distribution p(X ) is often very hard, especially for high-dimensional data, so an alternative solution is utilizing some statistics, i.e., mean or variance, to estimate the distribution.
Task: a task, T , is the specific application of domain adaptation, which contains two parts, label space Y and the mapping function f (·), so T = Y, f (·). Naturally, given data X , we have Y = f (X ). For standard domain adaptation problems, the tasks are the same, T s = T t .

Notations Description
the transformation matrix to be solved L FGL objective function for fuzzy graph learning L SF objective function for sparse filtering λ the balance factors between two objectives

Fuzzy Graph Learning for Domain Alignment
How to reduce the domain discrepancy is a key problem for DA, moment-based and adversarial training-based strategies cannot guarantee the separability of target samples, because they do not exploit the structural information existing across different classes. Therefore, we propose fuzzy graph learning for class-wise alignment. The underlying idea is: samples belonging to the same class (no matter which domain it comes from) should be close to each other. Since there are no labels for target samples, we must adopt pseudo labels as a substitution. Obviously, the quality of pseudo labels will greatly influence the learned classifier/representation, it may cause a catastrophic accumulation of errors if not handled properly.
For our method, we propose fuzzy pseudo labels which employ a probability matrix to update the classifier. In Figure 1, we give a graphical illustration to explain how soft label works. The class centroids are calculated by labeled source samples and the decision boundary can be obtained by means of the distance to each centroid. For an unreliable training sample which located in the intersection of decision boundary (also can be described as it has similar distance to each centroid), the probabilities of it belonging to any category is extremely close, e.g., 24%, 25%, 26%. In such situation, assigning the label with biggest probability is risky. Like the figure in the middle shows, the updated centroid would have a big influence to decision boundaries. When we adopt soft labels, the case is shown in the last figure, each centroid would move a distance to the training sample according to corresponding probability value. Intuitively, soft labels have significant less effect on decision boundaries than hard. Specifically, suppose we have the learned representationx s = f (x s ) ∈ R k×n s and x t = f (x t ) ∈ R k×n t (we will introduce the transformation f (·) later), along with the source labels y s ∈ [1, 2, · · · , c]. We use x = [x s ,x t ] ∈ R k×(n s +n t ) to denote the combination of two domains' features. The objective of fuzzy graph learning can be written as: where W ∈ R (n s +n t )×(n s +n t ) is the affinity graph where W ij ∈ [0, 1] denotes the similarity between x i and x j , and W can be defined as follows: Fuzzy connections only exist in cross-domain cases, so we describe how to construct the affinity graph separately: Construction of W ss : For labeled source domain, we adopt the supervised graph, that is, the edges exist only between samples with the same label. Ideally, this can help obtain discriminative representations.
Construction of W tt : For unlabeled target domain, we encourage target samples to be close to their neighborhood samples, thus a cut-off is necessary. In this paper, we exploit the top-k (k = 10 in this paper) nearest neighbors for target domain.
where N 10 (X i ) means the aggregation of X i 's top-10 nearest neighbors in target domain. Construction of W st and W ts : To build the fuzzy graph to model the cross-domain similarity, we first compute the class centroids x of source samples, then the probability p(y|x i t ) of target samplex i t belongs to class y can be computed based on the Euclidean distance to each centroid. Finally, W st indicates the similarity from source to target samples, then for a certain source samplex i s and target samplex j t , if they belongs to the same category, they should have a strong connection and vice versa. Since we utilize soft labels here, it can be interpreted as: if the target samplex j t has a big probability of possessing the same label of the source sample (noted that the source samplex i s and its label y i s are all available), they should have a strong connection. The pseudo code of cross-domain affinity graph construction is given in Algorithm 1, and we can obtain W ts by W ts = W T st directly.

Sparse Filtering for Discriminative Feature Learning
Fuzzy graph learning is able to reduce the domain discrepancy, but it is an ill-posed problem in nature. For instance, if we choose two constant function, k 1 (x) = a, k 2 (x) = b, where a = b, they both can minimize the objective L FGL . For the intuitive example shown in Figure 1, we can see that unreliable samples lead to representation shrinkage. Consequently, the learned representation will lose the discriminability, hence an additional penalty is indispensable. In this paper, we adopt the objective of sparse filtering for two reasons: Firstly, it requires lifetime sparsity, which projects the feature onto a L2-ball, this can effectively alleviate the feature contraction. Secondly, it allows us to exploit the potential discriminant structures in unlabeled target domain.
Corresponding to the optimization of sparse filtering, the transformation function k(·) of our method can be summarized as follows: Given original input x s , x t ∈ R n s ,n t ×m , the goal is to find optimal W ∈ R m×k , then obtain new representationsx s ,x t ∈ R n s ,n t ×k for downstream applications.
1. Linear transformation and non-non-linear activation. We use the soft absolute function as activation function in this paper.
where denotes a small number, such as 1e-5.
2. L2 column normalization Observing the form of currentx, each row represents a sample, and each column represents a feature. This measure enforces the L2 norm of each feature to be 1 thus high dispersal can be achieved. Noted that we do the within-domain normalization since it is proven to be powerful for DA problems.
which the symbol • represents Hadamard product. Here we use Hadamard product to simplify gradient derivation. 3. L2 row normalization. Just like the previous step, this measure enforce the L2 norm of each sample to be 1 to project data onto L2-ball.
For the sake of simplification, we integrate the mapping from x s , x s tox s3 ,x t3 as a single step k(·), so we can obtain the final representation byx s = k(x s ) =x s3 ,x t = k(x t ) =x t3 . Correspondingly, the objective of sparse filtering can be represented as: The total objective of our method can be written as: where λ is hyper-parameter to balance two objectives.

Optimization
Intuitively, our method can be formally defined as a weighted sum of two objectives, so we analyze the two parts separately.

Gradient of ∂L FGL ∂x
Thinking about the combination of two domains' features,x = [x s ,x t ] the objective can be reformulated as: where Dii = ∑ n s +n t j=1 W ij denotes the degree ofx i , and L = D − W is called Laplacian graph. Naturally, we can compute the gradient by:

Gradient of L SF
Noted that we havex ≥ 0 because of the activation. So the gradient of sparse filtering is simply a matrix full of 1.
Here we use 1 to represent the 1 matrix with certain dimension, e.g., ∂L SF ∂x s = 1 ∈ R n s ×k and ∂L SF ∂x t = 1 ∈ R n t ×k .

Gradient of k(·)
To propagate the gradient back to the input layer, we need ∂x ∂x to apply the chain rule. There are some differences for two domains since we adopt within-domain normalization. To obviate misunderstanding, we give the specific form for two domains.

Unified Optimization Based on Gradient Descent
Given above gradients, we are able to update W by gradient descent from random initialization. The training process is similar to train a neural network, some tricks that help converge can also be applied here, such as initializing with normal distribution and utilizing a decreasing step. The specific training scheme is shown in Algorithm 2.

Algorithm 2:
Optimization of the proposed method.

Experiments
In this section, we introduce two widely-used DA datasets, then the proposed method is compared with state-of-the-art works. In addition, we provide parameter sensitivity analysis to show the robustness of the proposed method.

Caltech
ImageNet Pascal Office-Caltech10 [22]: Office-Caltech10 contains ten object categories drawn from 4 image domains: Amazon (A), Webcam (W), DSLR (D), and Caltech256 (C). There are 8-151 samples per category per domain, and 2533 images in total. In our experiments, we adopt the Decaf6 (input dimension m = 4096) features of images as input for testing algorithms, twelve cross-domain tasks, e.g., C → A, C → W, C → D, are constructed.

Experimental Setting
Since there are no deep architectures in the proposed method, we compare our approach to several state-of-the-art shallow methods to evaluate its effectiveness.

•
Nearest Neighbor(NN) [23]. NN is served as a baseline model to check whether the learned representations really work for DA problems. Follow the experimental setting of JDA and MASF, we set the subspace dimension k = 100. For JDA, we set the regularization coefficient λ = 1 and the number of iterations T = 10. For CAPLS, we set the number of iteration T = 10. For MASF, we set the regularization coefficient α = 1e − 3. For SPL, we set the number of iterations T = 11. It is worth emphasizing that the input features are extracted by deep networks without fine-tuning and not any pre-processing strategy is applied in the experiments.

Results
In this section, we report the accuracy of the proposed method (abbreviated as 'FGLSF' for Fuzzy Graph Learning regularized Sparse Filtering) and other state-of-theart works, the results are shown on Table 2 Another interesting finding is that the proposed method works well in ImageCLEF, achieves best performance on all subtasks, but when it refers to Office-Caltech10, it seems that SPL works better. For these two datasets, the biggest difference is the class weights, ImageCLEF is an absolutely class-balanced dataset, which means that each object has the same number of pictures, but Office-Caltech10 has class weight bias, which makes it a more complicated problem. When we infer the pseudo labels, the object who has more examples is expected to be dominant, and vice versa. For the proposed method, we only use source labels for obtaining pseudo labels, so it may easily influenced by the class weight bias. SPL considers both source labels and manifold structures, thus achieves better performance on Office-Caltech10.

Parameter Sensitivity Analysis
For our method, there is one hyper-parameter to be adjusted, λ indicates the weight of fuzzy graph learning. We run FGLSF varying with values of parameter λ which are sampled from {0, 0.001, 0.01, 0.1, 0.2} on three tasks, e.g., C → W,W → C and P → C. If λ is too small, the model would fail to align distributions of two domains. And as it getting larger, the problem of representation shrinkage is emerging. The experimental results shown on Figure 3 are consistent with the analysis, the accuracy first rises and then falls as λ gets bigger.

Discussion
To prove the effectiveness of the proposed method, we conduct two sets of comparative experiments: 1. How the accuracy changes during training? 2. How about using selective hard labeling?

Does Iterative Learning Help Improve the Model?
To answer the question "Does iterative learning help improve the model", we report the performance of 18 tasks during training, whose results are shown on Figure 4. Specifically, we divide the full training process to 10 parts, and record current accuracy. Obviously, the model becomes more precise as training goes. On the other hand, we can also see some fluctuations, which suggests that our approach cannot completely eliminate the effects of mislabeling.   Figure 5 shows the performance of two labeling strategies. Intuitively, the proposed fuzzy labeling and state-of-the-art selective hard labeling have similar performance, e.g., selective hard labeling has better performance on tasks {7, 8} while fuzzy labeling is superior on tasks {5, 15}. This finding confirms that the proposed fuzzy labeling is able to alleviate the issue of wrong-labels.

Conclusions
In this paper, we propose a novel DA solution based on fuzzy graph learning, which aims to learn both discriminative and domain-shared representations simultaneously. When compared to data augmentation methods, it considers how to learn an supervised model under dataset bias, aims at minimizing the performance degradation brought by domain discrepancy. The main contribution is that we propose another view for the problem of misselected pseudo labels, from selecting labels to minimizing the negative effects of mislabeling. When compared to existing methods, the proposed method adopts a fuzzy labeling framework, which uses label matrix rather than a single selected label for target samples. To avoid trivial solutions, we combine sparse filtering and fuzzy graph learning to solve the problem of representation shrinkage. The experimental results verify that our method outperforms many SOTA shallow methods; the increases of mean accuracy are 0.3-6.6%. In the future, we plan to study how to adjust the hyper-parameter λ according to task similarity.

Conflicts of Interest:
The authors declare no conflict of interest.