Discriminative Sparse Filtering for Multi-Source Image Classification

Distribution mismatch caused by various resolutions, backgrounds, etc. can be easily found in multi-sensor systems. Domain adaptation attempts to reduce such domain discrepancy by means of different measurements, e.g., maximum mean discrepancy (MMD). Despite their success, such methods often fail to guarantee the separability of learned representation. To tackle this issue, we put forward a novel approach to jointly learn both domain-shared and discriminative representations. Specifically, we model the feature discrimination explicitly for two domains. Alternating discriminant optimization is proposed to obtain discriminative features with an l2 constraint in labeled source domain and sparse filtering is introduced to capture the intrinsic structures exists in the unlabeled target domain. Finally, they are integrated in a unified framework along with MMD to align domains. Extensive experiments compared with state-of-the-art methods verify the effectiveness of our method on cross-domain tasks.


Introduction
A basic assumption of many machine learning algorithms is that the training and testing data share the same distribution. Now, data sources are more diverse due to the lower costs of data acquisition. For visual images, it may cause inconsistent distribution with some small changes such as lighting conditions, acquisitions, or backgrounds. It is expensive to label each source's data. Domain adaptation aims to train a robust classifier based on a labeled source domain to predict on an unlabeled target domain [1], which achieved significant progress in image classification [2,3], speech recognition [4,5], person re-identification [6], and many other areas.
An intuitive idea for domain adaptation is to re-weight the training samples and reduce the distance between the source and target domains at the instance level [7]. Another popular way is to reduce the discrepancy between domains at the feature level, which attempts to learn domain-shared representations. Ben et al. pointed out that the transferable features can be obtained by minimizing the distance of domains and maximizing the source margin simultaneously [8]. Based on this theory, many feature-driven domain adaptation methods have been proposed. Pan et al. mapped the data from both domains to high-dimensional Hilbert space and then minimized the domain discrepancy [9]. The measurement employed in [9] is maximum mean discrepancy (MMD) [10], which is capable of characterizing the distance between two sets of samples. Long et al. adapted the marginal probability and conditional probability of the two domains simultaneously by assigning pseudo-labels to the target domain, and achieved more accurate results in an iterative manner [11]. Gong et al. integrated an infinite number of subspaces and characterized changes in consideration of geometric and statistical • We propose a novel unsupervised domain adaptation solution to reduce domain discrepancy and extract discriminative features simultaneously. Compared to existing works, the proposed method models feature distinctiveness with explicitly constraint. Comparisons with state-of-the-art methods show that our method works well in accuracy and efficiency. • Alternating discriminant optimization is proposed to obtain discriminative features in the labeled source domain, which utilizes an l2 objective to measure feature distinctiveness. We use a toy example to demonstrate how it works. • We combine sparse filtering and maximum mean discrepancy into an integrated framework, and propose an unified optimization method with full-batch and mini-batch gradient descent.
The rest of the paper is organized as follows. Section 2 details the domain adaptation problem and related works, then introduces sparse filtering and maximum mean discrepancy. Our method is introduced in Section 3 and experimental evaluation is presented in Section 4. At last, we summarize this paper and discuss future work in Section 5.

Related Works
In this section, we give a definition of transfer leaning and explain its relationship with domain adaptation. According to whether the labeled samples in the target domain are available, the problem can be divided into semi-supervised and unsupervised domain adaptation. In this paper, we focus on unsupervised domain adaptation, which means that the target domain does not have any labeled samples. Following that, we introduce sparse filtering and maximum mean discrepancy.

Transfer Learning and Domain Adaptation
There are two important concepts in transfer learning, domain and task. A domain, D = {X, P(X)}, can be thought of as a set of data, which has a feature space X and a marginal probability distribution P(X). The task also has two components, T = {Y, f (·)}, Y is the label space and f (·) is the mapping function. Traditional machine learning methods have the same domains and tasks between training and testing. When domains or tasks are different, we call it transfer learning (TL). According to the similarity of domain and task, TL can be divided into inductive TL and transductive TL. In this paper, we focus on transductive TL, where the domains are different but related and the tasks are the same.
Domain adaptation can be seen as a kind of transductive TL. Given source data and label (X s ,Y s ) and target data (X t ), where data in two domains have different distributions, domain adaptation (DA) aims to find the label of target data (Y t ). When the test set is completely unlabeled, it is called unsupervised domain adaptation, which is also the focus of this paper. The mathematical form is defined as follows [11].
P(X s ) = P(X t ) In this paper, we focus on unsupervised domain adaptation which means that target domain has no labels at all. Existing methods try to align features by means of varieties of transformations (e.g., kernel [29], deep neural networks [17]). One crucial thing is how to measure the discrepancy between domains. There are two widely used methods: (a) alignment with moments, whether the first-order moment (maximum mean discrepancy [10]) or the second-order (CORAL [30]); and (b) adversarial training. The main idea is to establish a feature extractor and a domain discriminator simultaneously and train them as generative adversarial nets [31].

Sparse Filtering
Sparse filtering is an effective and simple unsupervised feature extraction method proposed in [27]. It only requires one input parameter: the number of features. Unlike other feature extraction methods, it does not attempt to model the raw data. Instead, it starts with what are good features and directly constrains the extracted features. As a major contribution, the authors gave three principles of the so-called good features. Furthermore, the authors pointed out that we can obtain ideal representations by jointly optimizing population sparsity and high dispersal, so there is no need to optimize lifetime sparsity explicitly; interested readers can refer to the original paper for more details. Suppose now we have n samples, each with m-dimensional features that can be written as x = x 1 , x 2 . . . , x n ∈ R m . The optimization of sparse filtering is as follows: (1) Linear feature extraction. Let f (i) j represent the jth feature of the ith sample. f Then we can use some activation functions to make it more expressive, such as the soft absolute function.
(2) Solving high dispersal. Each feature is divided by the l2−norm of the feature on all samples.
Remember that the requirement of high dispersal is that the statistical properties of each feature are similar. This step forces the sum of the squares of all features to be 1 roughly.
(3) Solving population sparsity. Each sample is divided by its own l2−norm. Then, we can get the objective function.
An advantage of using l2 normalization is to introduce the competition mechanism-that is to say, while some components become larger, some components will become smaller. The result of competition is that the representation becomes sparse.

Maximum Mean Discrepancy
Maximum mean discrepancy (MMD) is widely used to measure the difference between distributions [10]. For domain adaptation problems, researchers pointed out that marginal distribution adaptation can be achieved by minimizing MMD which computes the distance between sample means in the k−dimensional embeddings [9,11].
where M 0 is the MMD matrix that can be computed as: Intuitively, the source and target data are integrated together as X ∈ R (n s +n t )×m where m denotes the feature dimension of original data and n s /n t denotes the number of source/target samples. The first n s columns are instances from the source domain and followed n t columns from the target domain. A ∈ R m×k is the adaptation matrix which maps the original x i and x j to k−dimensional. As shown on the left of Equation (5), MMD computes the mean vectors for the source and target domains first, then takes the l 2 −norm of the difference between the two vectors.

Methodology
In this section, we describe the proposed method in detail. First the framework of our approach is introduced. Then, it is followed by a detailed description of the proposed alternating discriminant optimization (ADO) and how the MMD is used in our method. Finally, we summarize the specific optimization problem.

Framework of Discriminative Sparse Filtering
In this paper, we try to learn both discriminative and domain-shared features. Our model consists of two parts: feature transformation and loss function construction. Using the notations defined in Table 1, feature transformation can be described as: Step 1. Linear feature extraction. Let f act denote the selected activation function. We use the soft absolute function as f act in this paperX where denotes a small number, such as 1e-5.
Step 2. Solving high dispersal. Observing the form of f 1 , each row represents a sample, and each column represents a feature. So this step is actually doing a l 2 column normalization. It is worth noting that we do within-domain normalization instead of cross-domain (which means that using all the samples from two domains to normalize). The idea is to force each feature has similar statistical properties in two domains by setting their l 2 norm to 1 rudely. As a consequence, a given feature should (a) have similar statistical properties in different domains and (b) be distinguishable over samples in the same domain.X Here, the symbol • represents Hadamard product.
Step 3. Solving population sparsity. Just like the previous step, this step does an l 2 row normalization. Table 1. Notations and descriptions used in this paper.

Notations Description
the transformation matrix to be solved W lr weight matrix for alternating discriminant optimization L target objective function for sparsity in the target domain L source objective function for alternating discriminant optimization in the source domain L mmd objective function for domain discrepancy α, β the balance factors among three objectives Notice that steps 2 and 3 do not change the dimension of samples; we can regard them as a specific activation. Based on the descriptions, the transformation from the initial data X toX is summarized aŝ X = f (X). The loss function can be described as: where L target represents the sparse loss on the target domain, L source represents the discriminative loss of the source domain, and L mmd denotes the MMD loss between the source and target features. α, β are the parameters that balance the three objectives. Obviously, L source and L mmd correspond to the two goals presented in [8]. Furthermore, we require the target domain features to be discriminant. A graphical illustration of the framework is shown in Figure 1. Given raw pictures from source and target domains, we first extract their vectorized features with pre-trained deep models, e.g., Alexnet and Resnet. It is worth noting that we do not employ any fine-tuning. Then, they are further reduced in dimension by a linear transformation matrix W, after steps 1-3. The objective constructing on the learned representation can be divided into three parts: (1)

Target Domain Sparsity: Sparse Filtering
In order to obtain discriminative features, we first need an indicator to evaluate the impact of current features on the classification. According to the theory in [32], classification error is the most effective evaluation criterion for feature selection. The specific process is to establish a classifier using the existing features and labels, and then take the classifier error as the discriminant index of the current features. However, there are not any labels in the target domain for unsupervised domain adaptation, which brings difficulties to extract discriminative structures. Since sparse filtering has made remarkable achievements in many areas, in this paper, we introduce sparse filtering for the target domain.

Source Domain Discriminability: Alternating Discriminant Optimization
For the labeled source domain, we can establish a classifier using the transformed features f (X s ) and labels Y s directly. Different from heuristic feature selection, we hope to solve the optimal transformation matrix by combining the sparsity of the target domain data, which requires that this indicator can be optimized using gradient information. So the classification model whose parameters are solved in an iterative manner (e.g., neural network and SVM) is no longer applicable. In this paper, we use mean square error, then obtain discriminative features by alternately optimizing two parameters.
Suppose that we have source features F s = f (X s ) ∈ R n s ×k and labels Y s . To measure the discriminability of the features, we need W lr ∈ R k×1 to map features to label space; here, we use linear mapping function because it can be easily solved without multiple iterations. The objective can be described in mathematical form as: A linear regressor maps original data to label space by means of a transformation matrix (W lr ). Obviously, we actually perform linear regression in the feature space. As mentioned earlier, classification error is the most effective index for feature selection, but it is an l 0 constraint and makes trouble for optimizing with gradient information. Here, we relax the constraint to l 2 which is equivalent to linear regression. In general, we hope to measure the discriminability of features by l 2 constraint and optimize it with gradient descent.
For this two-variable (W and W lr ) optimization problem, it is hard to optimize two parameters simultaneously. So we borrow the ideas of alternating direction method of multipliers (ADMM) [33]. W transforms original data to features space where we construct a linear regressor by means of W lr . At each iteration, we first solve the linear regressor by normal equation, then update W by the chain rule and gradient descent. The specific process is showed in Algorithm 1.

Algorithm 1: Alternating Discriminant Optimization.
Input: X s ∈ R n×m , Y s ∈ R n×1 Output: W = argmin L Initialize W ∈ R m×k while stopping conditions are not satisfied do The main idea of ADO is to find the optimal classifier parameters W lr for each generation of input features, and then optimize the mapping function W based on it. With the iterations, the mapping features will have a smaller regression error with the optimal W lr . In Figure 2. we show how ADO solves XOR problem. Specifically, we set four examples, i.e., class zero (denoted by blue diamonds): [0,1], [1,0] and class one (denoted by red circular): [0,0], [1,1], which cannot be divided by a single line. ADO computes the optimal decision boundary, then learns a nonlinear mapping to minimize classification error. As the figure shows, samples are mapped into another two-dimensional feature space where they are linearly separable. Correspondingly, we can formulate L source as follows.

Domain Discrepancy: MMD
We have described how MMD works with linear transformation, but there is a small change in our method. In the previous presentation, we mapped the data to the feature space by multiplying matrix (x → A T x). The case is more complicated here (x → f (x)), but the idea is similar.
where X = X s X t denotes the merged data sets.

Optimization
In this section, we give the detailed process to solve three objectives.

Optimization of L target
It is the same as applying sparse filtering on the target domain data.
At each iteration, update M r , M ct and then use the updated parameters to calculate the gradient. Notice that we do not give the specific derivation of the select activation functions (soft absolute function); a more general form of the problem is given here, and more activation functions can be used, such as sigmoid and tanh.

Optimization of L source
Based on the derivation in the previous section and the chain rule, we have: where represents the analytic solution of linear regression applied on source features.

Optimization of L mmd
We give the derivation of ∂L mmd ∂ f (X) ; the rest is the same as where X consists of X s and X t , so Equation (17) provides the gradients of f (X s ) and f (X t ). It is worth noting that M r , M c are different for the two domains, so we should compute the gradients separately.
Given these, we can update W with W = W − α ∂L target ∂W − β ∂L source ∂W − ∂L mmd ∂W , and the flowchart can be found in Figure 3.

Experiments
In this section, we introduce two data sets for domain adaptation and the experimental settings, then give the results. In addition, we provide an empirical analysis to show the robustness of the proposed method.

Office-Caltech10
Office-Caltech10 data set is proposed in [12], which consists of four domains-AMAZON (A), CALTECH (C), DSLR (D), and WEBCAM (W). It comes from the e-commerce website (AMAZON), data set caltech-256 (CALTECH), high-resolution digital camera photo (DSLR), and low-resolution photo (WEBCAM). Each domain has 10 types of objects, including laptop, monitor, and so on.

ImageCLEF
ImageCLEF is an online competition for domain adaptation, which has three domains (Caltech (C), Imagenet (I), and PASCAL (P)) and twelve classes of objects.

Experimental Setting
The existing methods can be roughly divided into shallow methods and deep methods. Though our method does not have deep architectures, we choose some deep methods to illustrate its effectiveness. Following [11], we convert the data to 100 dimensions by our method, then use a 1-nearest neighbor for classification. For our method, we set α = 0.1 and β = 1e − 5.
The selected state-of-the-art methods are: • Follow the experimental setting of JDA and MASF, we set the subspace dimension k = 100. For JDA, we set the regularization coefficient λ = 1 and the number of iterations T = 10. For CAPLS, we set the number of iteration T = 10. For MASF, we set the regularization coefficient α = 1e − 3. For SPL, we set the number of iterations T = 11. For GSMAX, we set the regularization factor to 1e − 5. It is worth emphasizing that the input features are extracted by deep networks without fine-tuning and no pre-processing strategy is applied in the experiments.
Following the setting of [35,36], we report the classification accuracy on target data as the evaluation metric.
whereŷ denotes the predicted label and y is the true label, so 0 ≤ Accuracy ≤ 100.

Implementation Details
(1) Initialization. We find that setting the initial value near 0 can significantly improve the convergence. In this paper, we set it to N(0, 1) × 0.001 where N(0, 1) denotes Gaussian distribution. We fix the random number seed to 0 (in MATLAB) for the reproducibility of this paper.
(2) Gradient descent. We set the maximum number of iterations to 200 and the step size to 0.1.

Results
In this section, we report the accuracy of the proposed method (abbreviated as DSF for discriminative sparse filtering) and other state-of-the-art works; the results are shown on Table 2 and a detailed comparison can be found on Table 3  The difference of sample numbers, also referred as class weight bias, is a fundamental problem for measuring distribution differences. Existing measurements, e.g., MMD and CORAL, employ the first/second/higher order moments to quantify distribution differences, which assume that the source and target data share the same class weights; however, such an assumption does not always hold (like Office-Caltech10). However, our method also yields good classification results. The reason is twofold: (1) the class weight biases are not so severe that they will lead to catastrophic accumulation of errors.
(2) There are other regularizations, i.e., the proposed ADO and sparse filtering. The ideal features should be both domain-shared and discriminative, so the negative effects can be further suppressed. Another interesting phenomenon is the different results after changing the order of two domains; this can be explained by the information asymmetry. Imagine that two sets A and B, where A∈B, so if we choose B as training set and A for testing, the model would achieve satisfactory performance. If the order is changed, the model would fail since A cannot provide enough discrimination power.

Ablation Study
For better understanding of the proposed method, we conduct an ablation study to analyze how different components contribute to the final performance. Since there are too few samples for some domains of the Office-Caltech10, e.g., 157 images in total for DSLR, we use ImageCLEF for ablation study only. Compared to original sparse filtering, we proposed two strategies, i.e., MMD for distribution matching and ADO for source discrimination. Through the arrangement and combination of two elements, we can construct 2 2 = 4 experiments. We use and to denote the status of two components, e.g., MMD () + ADO () indicates that current model is MMD regularized sparse filtering. As Table 4 shows, when the two components are all activated, the method achieves the highest average performance. Adding one component can also improve the final prediction.

Parameter Sensitivity Analysis
In this paper, we introduce two parameters, α and β, to balance the three parts of our objective. α is the coefficient of target sparsity; we hope to preserve the invisible structure of target samples by constraining its sparsity. Similarly, β indicates how much we care about source discriminability.
In this situation, we do not care about the discriminability of both domains. All we need is to reduce domain discrepancy by reducing the MMD loss, which is similar to TCA [9].
(2) α ↑, β ↑. Extended from TCA, we hope to obtain discriminative representations while reducing domain discrepancy in some sense. However, if they are too large, we cannot learn transferable knowledge across domains. As Figure 5 shows, α becomes larger from front to back and β increases from left to right. Obviously, the highest peak occurs in the middle of the surface, which manifests that the proposed two strategies are both necessary. When α/β becomes too large (corresponding to the right and rear of the surface), the accuracy decreases sharply since we pay too much attention to feature discriminability while ignoring the fundamental problem, i.e., distribution matching.

Running Time
Using given notations, the computational cost is detailed as follows: max(O(n s · k 3 ), O(n s · m · k)) for solving L source , O(n t · m · k) for L target , and O((n s + n t ) 2 · k) for L mmd . In summary, suppose we take T as the number of iterations; the overall computation complexity of algorithm is T · max(O(n s · k 3 ), O(n s · m · k), O(n t · m · k), O((n s + n t ) 2 · k)).
In this section, we record the running time (feature extraction + classification with NN) of previous experiments. All algorithms are implemented via MATLAB 2017a and executed on a Windows PC with Intel Core i7 CPU at 3.6GHz and 8GB RAM. Table 5 shows the results. Intuitively, we can see that the proposed method computes faster than most of other works on average running time, especially CAPLS and SPL.

Discussion
In this section, we discuss the influence of different gradient-based optimization methods on the proposed framework.

Mini-Batch versus Full-Batch
In the previous section, we show how to apply gradient descent for optimizing the proposed method, which means that we need all data (n s + n t ) for computing. However, real-world applications may have large amount of data so that our computer cannot handle the heavy computation. Consequently, stochastic gradient descent (SGD), which adopts a subset (k n s + n t ) of data during each iteration, is necessary. In this section, we analyze how mini-batch based optimization may affect our method both theoretically and practically.

Implementation of Mini-Batch-Based Optimization
Mini-batch SGD randomly selects a part of samples to calculate gradients rather than on the whole data set. Similarly, we can solve the proposed framework with mini-batch SGD. Firstly, we should select samples in source and target domain (MiniX s , MiniX t ) separately since MMD needs data from both domains. The batch size can be determined by our computation resource. Then by treating the two mini-batch (MiniX s , MiniX t ) as X s and X t , we can update the parameters using gradient descent (showed in Section 3.5. Optimization).

Influence of Mini-Batch SGD
Here, we analyze how mini-batch SGD affects the proposed method. Since we use random mini-batch instead of the full batch, the sampling error cannot be ignored. For sparse filtering (corresponding to L target ), as an unsupervised feature extraction method, it requires a diversity of data. In extreme cases, suppose that we have data from the same class. Sparse filtering tries to extract distinguishable features; in other works, it tries to make samples from the same class to be different, which is counterintuitive. For alternating discriminant optimization (corresponding to L source ), it learns discriminative features with labeled source samples. If the samples belong to the same class in the mini-batch, it outputs meaningless gradients. For MMD (corresponding to L mmd ), it measures the domain discrepancy with first-order statistics. The sampling error is reflected in the gap between the mean of mini-batch and the full batch ( MiniX s = X s , MiniX t = X t ). To summarize, using mini-batch SGD will lead to performance degradation and the degradation will become larger as the batch size becomes smaller. As Figure 6 shows, the proposed method achieves higher accuracy as the batch size becomes larger, as does the average accuracy. It is worth emphasizing that using mini-batch based optimization is not time-efficient; in fact, it often costs more time. The reason is that we need a small step size and more iterations to train the model, since a min-batch provides a biased estimation of the whole data set. It works when our computer cannot handle the large data set at a time-in other words, it can be seen as a trade-off of time and space.

Conclusions
In this paper, we propose a novel feature extraction method for unsupervised domain adaptation, which consists of three parts: (a) Since the target domain has no labels, sparse filtering is introduced to capture its discriminative structure in nature. (b) For the labeled source domain, we propose alternating discriminant optimization to directly model the relation of learned representation and labels; a toy experiment of XOR problem shows its validity. (c) We integrate MMD into the framework to reduce domain discrepancy and a unified optimization based on gradient descent is raised. Adequate experiments show that the proposed method is comparable or superior to existing methods. Furthermore, we give a mini-batch based optimization framework such the proposed method can be applied in large-scale problems. In the future, we plan to study how different metrics work to measure domain discrepancy.