Exploring Inter-Instance Relationships within the Query Set for Robust Image Set Matching

Image set matching (ISM) has attracted increasing attention in the field of computer vision and pattern recognition. Some studies attempt to model query and gallery sets under a joint or collaborative representation framework, achieving impressive performance. However, existing models consider only the competition and collaboration among gallery sets, neglecting the inter-instance relationships within the query set which are also regarded as one important clue for ISM. In this paper, inter-instance relationships within the query set are explored for robust image set matching. Specifically, we propose to represent the query set instances jointly via a combined dictionary learned from the gallery sets. To explore the commonality and variations within the query set simultaneously to benefit the matching, both low rank and class-level sparsity constraints are imposed on the representation coefficients. Then, to deal with nonlinear data in real scenarios, the‘kernelized version is also proposed. Moreover, to tackle the gross corruptions mixed in the query set, the proposed model is extended for robust ISM. The optimization problems are solved efficiently by employing singular value thresholding and block soft thresholding operators in an alternating direction manner. Experiments on five public datasets demonstrate the effectiveness of the proposed method, comparing favorably with state-of-the-art methods.


Introduction
Image set matching (ISM) or image set classification, which regards one set of images as a sample, has recently attracted considerable attention due to its widespread applications such as video-based face recognition, multi-view object recognition, dynamic scene classification [1][2][3][4][5][6][7][8][9][10]. Compared with traditional single image based classification, ISM has the merit of incorporating information from multiple images of one set to provide all-sided description for a subject and thus achieve better matching accuracy.
Most of the existing ISM methods roughly fall into two categories, parametric and non-parametric. Parametric methods [11,12] try to model image sets with certain statistical distributions and then evaluate the similarities among those distributions. Non-parametric methods tend to assume that the image sets have underlying linear or nonlinear geometric structures and construct global set models, such as subspaces [2,13,14], manifolds [6,15,16] and affine/convex hulls [8,9,[17][18][19][20][21], etc. Non-parametric methods have shown superior performance in the cases when the data distributions do not meet the models estimated by the parametric methods. Recently, some methods [1,10,22,23] have presented various further improvements following conventional non-parametric methods. Ref [22] proposes Nonlinear Subspace Feature Enhancement (NSFE) for nonlinearly embedding image sets into a space where they adhere to a more discriminative subspace structure. Ref. [23] combines multiple manifolds as the features of the original image sets, utilizing well-studied Riemannian kernels to map the original Riemannian spaces into high dimensional Hilbert spaces. In [1], aiming at alleviating the handicap that over-large affine hull usually fails in mathcing when two hulls is overlapped, Learning of Reduced Prototypes and Local Metric (LRPLM) is proposed to acquire powerful discriminative ability. Ref. [10] uses the mean vector, subspace and covariance matrix which lie on different spaces (Euclidean or manifold) to jointly represent an image set, and develops a multimodel fusion metric learning (MMFML) framework to reduce the dissimilarity between the heterogeneous spaces.
Many of the methods mentioned above, however, describe the query and gallery sets separately, and then measure the query-to-gallery distances/similarities using nearest neighbor or subspace classifiers. Firstly, they do not explore the relationships among the gallery sets which are helpful for improving ISM accuracy. Secondly, such a way of modeling separately followed by matching the models is less straightforward, or even unfair in terms of distance measurement. Finally, existing methods often depend heavily on the extracted features, i.e., different feature extraction techniques make big difference to the final matching performance. In recent years, deep learning-based image set classification methods [3,7,[24][25][26] are proposed where feature extraction plays a significant role in the deep neural networks. Ref. [27] models a set of Convolutional Neural Network (CNN) features (as inputs) by a convex cone and measure the geometric similarity of convex cones for image set classification. Even some deep learning-based methods need to use extra hand-crafted features, e.g., in the method of Deep Reconstruction Models (DRM) [3] , the first step of learning DRM is to compute Local Binary Pattern (LBP) features [28].
Sparse Representation has been very popular in many fields such as image classification [29,30], dictionary learning [31], color image restroation [32], recovery of remote sensing contaminated products [33], missing information reconstruction of remote-sensing images [34,35]. Sparse representation classification (SRC) [29] and Collaborative Representation Classification (CRC) [30] are two well-known classifiers which can bridge the testing sample and the training samples together under a unified framework, achieving impressive performance for single image based classification. In the unified framework, SRC or CRC implicitly describes the relationships (competition and collaboration) among the training samples while measuring directly the distances/similarities between testing and training samples. Furthermore, such methods are not sensitive to the feature extraction techniques, i.e., whatever kind of extracted feature is fed to the models based on SRC or CRC leads to good performances. Studies [36,37] either utilize the mean of the images within the query set or choose a candidate image by clustering the query set to conduct basic (single image based) SRC for image set based applications, but the basic SRC used can not exploit the rich information involved in image sets, especially in the query sets.
To accurately measure the distances between the query and gallery sets, some methods attempt to project all the sets into a collaborative or joint sparse representation framework for image set classification [19,38,39] and multimodal/multi-task recognition [40,41]. Image set based collaborative representation and classification (ISCRC) [19] describes the query set as a convex or regularized hull and represents the hull collaboratively over all the gallery sets. Group Collaborative Representation (GCR) [39] takes advantage of the relationships among gallery image sets to capture the inter-set and intra-set variations, and determines the characteristic subspaces of all the gallery sets. Methods [38,40,41] jointly represent the query set (or multiple testing features) over the gallery sets (or multiple training features) with different constraints imposed on the representation coefficients. These methods mainly explore the relationships (competition and collaboration) among different gallery sets, and even correlations and variations within gallery sets, without carefully looking into the inter-instance relationships within the query set. However, the commonality and variations within the query set are also important clues for accurate ISM tasks. In addition, existing methods are not robust to corruptions of large magnitude mixed in the query sets. Once the query sets are grossly corrupted by outliers or partial perturbations of large magnitude (e.g., occlusions), the performances of many methods will degrade drastically.
Motivated by the above insights, in this paper, we propose a new joint representation model to highlight the inter-instance relationships within the query set for improving ISM. To emphasize the inter-class discrimination among gallery sets, a class-level sparsity constraint is imposed. While for the gallery set, it is exactly the class-level sparsity constraint that implicitly connives even enlarges the variations across the instances within the query set. Therefore a constraint is imposed on the representation model by low rank regularization, to eclectically embody the commonality among instances within the query set. The combination of the two constraints explores fully both the inter-set and intra-set relationships among the query and gallery sets. The main contributions of this paper are summarized as follows: • A joint sparse representation model with class-level sparsity constraint is chosen for ISM problem and then a low rank regularization is added to reveal thoroughly the intra-set and inter-set relationships to improve the ISM performance. To deal with nonlinear data in real scenarios, the kernelized variant of our method is presented. • For the problem of grossly corrupted data encountered in real scenarios, which is rarely mentioned in existing ISM methods, the proposed model is extended to its robust versions to tackle different types of gross corruptions. • The optimization challenge is solved efficiently by employing singular value thresholding and block soft thresholding operators in an alternating direction manner. The optimization algorithms for the kernelized and extended versions are modified accordingly. • Experiments on five public datasets demonstrate that the proposed method compares favorably with competing state-of-the-art methods.
The remaining of this paper is organized as follows. Section 2.1 reviews some related work, comparing and analyzing several different joint representation models. Section 3 proposes the improved joint representation model and its kernelized version, and describes the optimization procedures for solving the objective functions. Robust extension of the proposed model and the corresponding optimization are presented in Section 4. Experimental results and analyses are showed in Section 5. Section 6 concludes this paper.

Joint Representation
Joint representation with sparsity induced by l 1 -norm, which is extended from SRC [29], inherits the latent discriminability resulting from the sparsity of representation coefficients. There are many images in each query set for ISM problem, hence we can consider to use this joint sparse representation to bridge query and gallery sets together. Specifically, the joint sparse representation model can represent the images within the query set via training images with joint sparsity constraint imposed on the representation coefficients. As shown in Figure 1a, given feature matrix X = [X 1 , ...X c , ..., X C ] from gallery sets of C classes, where X c ∈ R d×n c stands for n c images of class c ∈ {1, . . . , C}. Each image is represented with d dimension feature. The feature matrix Y ∈ R d×n q of n q images in a query set can be represented as linear combination of the gallery set features X , namely Y = XW. Then the joint representation coefficient W can be obtained by solving the following problem: where λ is a positive parameter, • F denotes Frobenius norm of a matrix and W 1 means the l 1 -norm sparsity constraint that is imposed on the joint representation coefficients. This l 1 -norm regularized model, whose representation coefficients are illustrated in Figure 2a, is just a trivial extension of basic sparse representation model. The naive sparsity constraint embodies latent discrimination to some degree, without reflecting clearly any intra-set or inter-set relationships. For ISM problem, inter-instance relationships within the query set is one important clue for ISM, so that more meaningful constraints should be considered to reveal real between-data relationships.

Joint Representation with Row-Level Sparsity
Recently, a joint representation model with row-level sparsity constraint has been proposed in the field of multimodal biometric recognition [40], which is formulated as follows: where W 1,2 is a mixed norm defined as W 1,2 = ∑ n k=1 W k 2 , and W k is the K-th row vector of W. The l 1,2 -norm is a convex relaxation of l 0,2 -norm, which applies l 2 -norm on each row of and then applies l 0 -norm on the resulting vector. The representation coefficients acquired by solving Equation (2) will present the row-level sparsity, as shown in Figure 2b. The nonzero rows show that the nonzero values in different coefficient column vectors share the same group of indices, meaning that every column vector of Y gets expressed as the linear combination of the same group of atoms (column vectors) of X. When applying this model for ISM, the coefficient column vectors representing different images within the query set Y will share the same sparsity pattern, highlighting the inter-instance similarity within the query set. However, it ignores the variations across the images within the query set, which will distort the intrinsic relationships within the query set and degrade the matching accuracy.

Joint Representation with Class-Level Sparsity
Similarly, another joint representation model with class-level sparsity constraint is proposed in the field of multi-task learning [41], which is formulated as follows: where W 1,CF is a mixed norm defined as W 1,CF = ∑ C c=1 W c F , and W c ∈ R n c ×n q denotes the components of W associated with the gallery features X c of class c . Similarly, the l 1,CF -norm defined here is a convex relaxation of l 0,CF -norm, [ W 1 F , ..., W c F , ..., W C F ] 0 . The representation coefficients acquired by solving Equation (3) will present class-level sparsity, i.e., the nonzero values will group to a few classes. Ideally, as shown in Figure 2c, all the nonzero coefficients should fall into the block associated with the class Y belongs to. In this case, the indices of nonzero values in different coefficient column vectors locate in the same class block but could distribute in rather different positions. Hence, every column vector of Y gets expressed as linear combination of the same classes of but maybe different atoms of X. When applying this model for ISM, the class-level sparsity induced by the l 1,CF -norm constraint employs the class labels as prior supervised information. However, it maybe also connive even enlarge the variations across the images within the query set Y, which will impede the further performance improvement.  (1)): simple joint sparse representation without highlighting any intra-set relationships; (b) Row-level sparsity (as in Equation (2)): the sparse representation vectors share a exactly the same sparsity pattern, highlighting inter-instance similarity within the query set; (c) Class-level sparsity (as in Equation (3)): ideally, the non-zero coefficients fall into a same class block but maybe distribute in disorder, conniving large variations among the images within the query set ; (d) The proposed class-level sparsity with low rank regularization (as in Equation (4)): the non-zero coefficients fall into a same class block and distribute in order with slight anomaly, revealing both the commonality and the variations across images within the query set.

The Proposed LRCS Model
In terms of gallery sets, since the class-level sparsity induced by l 1,CF -norm constraint is able to employ the class label information to enhance inter-class discrimination, we prefer the class-level joint sparse characteristic for ISM task.
More importantly, the inter-instance relationships within the query set is also one important clue for ISM. In terms of query set, the l 1,2 -norm constraint highlights the intra-set commonality (similarity) excessively, without considering the variations across the images within the query set. In contrast, l 1,CF -norm constraint connives large variations across the images within the query set, leading to the disorder of the distribution of non-zero coefficients located in the target class ofŴ and thus weakening the correlations (commonality) among the images within the query set. To trade-off between the commonality and variations among the images within the query set, we consider imposing both low rank and class-level sparsity constraints on the representation coefficients for improving ISM. Based on the class-level sparse joint representation model, when the coefficients W is restricted to be low rank, Y is approximately close to XW which is also low rank, embodying the intra-set commonality and avoiding overlarge variations. Thus, Low Rank regularization on Class-level Sparse joint representation model (LRCS) is proposed, which can be formulated to solve the following minimization problem: where γ is a positive parameter controlling the low-rankness; and low rank constraint is convex relaxed into nuclear norm regularization, denoted by W * (equals the sum of singular values of matrix W). In real applications, the gallery set feature matrix X is often redundant and large, influencing the computational efficiency. Therefore, we compress feature matrix of each gallery set via dictionary learning method [31]. Finally, each class of feature matrix X c can be replaced with a learned class-specific dictionary D c . By replacing X with D = [D 1 , ...D c , ..., D C ], Equation (4) is changed to: This model inherits the merits of unified model. Thus, not only the relationships (competition and collaboration) among gallery sets are described implicitly but also the distances between query and gallery sets can be measured directly. Furthermore, the combination of low-rankness and class-level joint sparsity constraints is exactly to trade off the commonality and variations among the query set. The representation matrix recovered via Equation (5) are class-level sparse, and the coefficients within the target class block are distributed orderly with slight anomaly, as illustrated in Figure 2d.
Once the coefficient is obtained, the class label for the whole query set is predicted according to the smallest reconstruction error: whereŴ c denotes the components of the recoveredŴ associated with the dictionary D c .

The Optimization to Solve LRCS
In this section, we employ singular value thresholding and block soft thresholding operators alternately to solve the optimization problem of the proposed LRCS method. Firstly, an auxiliary variable P is introduced to reformulate the objective function: Then, the equality constraint is augmented into the objective function: where T is a Lagrangian multiplier matrix, µ is a positive parameter, and A, B = tr(A T B) . Problem (8) will be solved iteratively. In each iteration, W, P and T will be updated alternately while keeping the other variables fixed.
(1) Update W: the subproblem for updating W has the following form which does not have a closed-form solution. Inspired by strategy of LADMAP [42], we denote the smooth component of the above objective function (9) by q(W, P k , , then minimizing the objective function (9) can be replaced by solving the following problem: and ∇ W q W k is the gradient of w.r.t. W. As long as η k ≥ µ k + D 2 2 , where • 2 is the spectral norm of a matrix, namely, the largest singular value, the above replacement is valid. Then problem (10) has a closed-form solution given by where Θ ε (A) = US ε (∑)V T is the singular value thresholding operator (SVT) [43], in which U ∑ V T is the singular value decomposition (SVD) of A.
(2) Update P: the subproblem for updating P has the following form Involving the l 1,CF -norm constraint, problem (12) can be solved by utilizing the proximal-gradient algorithm [44]: where P k+1 c is the c-th class of components of P k+1 , k + 1/2 denotes an intermediate status, and Equation (14) involves a block soft thresholding operator [44].
(3) Update T : Finally, the Lagrange multiplier matrix is updated as follows: The proposed algorithm is summarized in Algorithm 1. For faster convergence, µ can be adjusted using the adaptive updating strategy as shown in step 4 in Algorithm 1. The iteration process will terminate when the changes of the objective variables in two consecutive iterations are all below some threshold, and the difference between P k+1 and W k+1 reaches a preset threshold.

Algorithm 1 Low Rank regularization on Class-level Sparse joint representation model (LRCS) and its kernelized version (K-LRCS) for Image Set Matching (ISM)
Input: Learned dictionary D, query set feature matrix Y and balance parameters λ and γ. For the kernelized version, choose proper kernel and its corresponding parameters. Output: Representation coefficient matrix W.

Computational Complexity
To analyze the computational complexity of the proposed model LRCS, we look through the procedures in the optimization algorithm. For convenience, we denote the number of the atoms of the learned combined dictionary by n D , the number of instances within the query set by n q , the feature dimension by d and the average number of iterations by K. We can find that at each iteration of Algorithm 1, the dominant computational costs come from the computing of gradient and singular value thresholding in Equation (11). The first two terms D T DW k and −D T Y in the gradient in the step 1 in Algorithm 1 involve large computation, and −D T Y can be pre-computed, then the main cost for gradient estimation is O(Kn q n D d + n q n D d). The cost for the singular value thresholding is O(Kn D 2 n q ). Therefore, the computing of Equation (11) through K iterations requires the computation in the order of O(Kn q n D d + n q n D d + Kn D 2 n q ). The computational overload in the steps are negligible compared to that of step 1.

Kernelized LRCS for Nonlinear Data
The proposed linear model will not work well when confronted with nonlinear data in real scenarios. Hence, we extend it to kernel space. The kernel function K : R n × R n is defined as the inner product K(x i , x j ) = ϕ(x i ), ϕ(x j ) where ϕ is an implicit mapping, projecting the vector x into a higher dimensional space. The projected query set features can be written as ϕ(Y) = [ϕ(y 1 ), ϕ(y 2 ), ...ϕ(y n t )] and the dictionary of gallery sets can be represented as ϕ(D) = [ϕ(d 1 ), ϕ(d 2 ), ...ϕ(d n D )], n D is the total number of atoms. Applying the linear joint representation in the higher dimensional space, we have ϕ(Y) = ϕ(D)W . Similar to the linear case, we seek to solve the following optimization problem: which can be reformulated w.r.t kernel matrices aŝ where the kernel matrix K A,B is defined as K A,B (i, j) = ϕ(a i ), ϕ(b j ) , a i and b i are the i-th column of A and the j-th column of B respectively. The approach to minimizing the kernelized objective function (17) has almost the same optimization procedures as in the linear case, except for the computing of the gradient of the smooth components q(W, P k , T k ) in the procedure (1) to update W. We need to replace 1 2 Y − DW 2 F with 1 2 Trace W T K D,D W − 2K Y,D W and compute the corresponding gradient.
OnceŴ is obtained, ISM can be done by assigning the class label bŷ The kernelized version can deal with nonlinear data in ISM problem. We call Equation (17)  Initialization

Robust LRCS for Image Set Corruptions
When the query sets are corrupted heavily, the initially proposed model will degenerate just like most of the existing ISM methods. In this section, the proposed LRCS is extended to robust LRCS, performing against two different corruptions of large magnitude mixed in the query sets.
As shown in Figure 1b, in the case that the images within the query set are perturbed by gross corruptions, namely Y = DW + E, where E could be random sparse corruptions or image-specific corruptions, we should consider modifying the originally proposed model to different corruption types. As illustrated in Figure 3a, random sparse corruptions indicate that a fraction of random entries of Y are grossly corrupted, such as partial blurs or occlusions in some images within the query set. In other words, E is a spare error so that we consider adding a sparse constraint term E 0 to the original model. It can be convex relaxed into E 1 . Image-specific corruptions, e.g., outliers as illustrated in Figure 3b, indicate the phenomena that a fraction of images within the query set (i.e., some columns of Y ) are entirely corrupted. We consider adding a column-sparse constraint term to the original model, namely E 2,0 = # i : [E] :,i 2 = 0 , and convexly relax it into E 2,1 = ∑ i [E] :,i 2 . Based on the previous analyses, the extended robust model is formulated as follows: where β is a positive parameter controlling the degree of corruption, and • indicates a certain regularization. The l 1 -norm is used for characterizing the random sparse corruptions, and the l 2,1 -norm is for dealing with image-specific corruptions.
(a) (b) As for the optimization to solve the problem (19), one can still employ similar procedures introduced in Section 3.2. Specifically, by introducing an auxiliary variable then we can have the augmented Lagrangian form of the objective function as follows: where the problem (20) will be solved iteratively in which each iteration will be updated alternately while keeping the other variables fixed.
(1) Update W: this procedure is almost the same to the procedure (1) in Section 3.2, and we just need to replace 1 2 Y − DW 2 F with 1 2 Y − DW − E 2 F and go on subsequent computing. (2) Update P: this procedure is just the same to the procedure (2) in Section 3.2.
(3) Update E: the subproblem for updating E has the following form where if • is l 1 -norm, the solution to problem (21) can be computed by the soft thresholding operator S ε (x): if • is l 2,1 -norm, the solution to problem (21) can be computed via: (4) Update T : this procedure is just the same to the procedure (3) in Section 3.2.
Once W and E are obtained, ISM can be done by assigning the class label bŷ We call Equation (19) together with Equation (24) Robust Low Rank regularization on Class-level Sparse joint representation model (R-LRCS). The proposed robust algorithm is summarized in Algorithm 2. The complexity analysis of Algorithm 2 is analog to that of Algorithm 1 as stated in Section 3.3.

Experiments
In this section, extensive experiments on five datasets are run to demonstrate the efficacy of the proposed model and its extensions. The joint representation models with different constraints are first compared showing the significance of the proposed model. Then, the comparisons with other state-of-the-art methods are presented for different ISM tasks , and finally, the impressive robustness of R-LRCS to different corruptions is verified.

Datasets and Preprocessing
Honda dataset: The Honda dataset [12] consists of 59 face video sequences which are collected from 20 different subjects. In each video sequence, there are a number of frames ranging from 12 to 645. The face images in the frames are automatically detected,cropped, and resized to 20 × 20. Finally, each video is processed to get an image set. For each class (subject), one image set is randomly selected as the gallery set for training while the rest ones are the query sets for testing.
YTC dataset: The YouTube Celebrity (YTC) dataset [45] is collected from YouTube, including 1910 videos from 47 celebrities. The face images in this dataset show large variations in expression, illumination and pose, as shown in Figure 4. Moreover, since the image compression rate is high, the resolution and quality of the images are extremely low. We extract a maximum of clearly detected face images from each video and resize them to 30 × 30 to form an image set. The dataset is equally divided into five folds, and in each fold there are nine image sets for each subject, of which three are used for training and six for testing.

ETH-80 dataset:
The ETH-80 dataset [46] contains eight categories of objects: cups, dogs, horses, apples, cars, cows, tomatoes and pears, as shown in Figure 5. Each category is divided into ten subcategories, each of which involves plenty of images under 41 orientations with various brands or breeds are exhibited. The images are cropped and resized to 32 × 32. Each subcategory is processed to obtain an image set and therefore there are 80 sets in total. For each object, half of the sets are randomly selected for training and the remaining are used for testing.

UCSD Traffic dataset:
The UCSD Traffic dataset [47] contains 254 video sequences of highway traffic with varying patterns (i.e., light, medium and heavy) in various weather conditions (e.g., cloudy, raining, sunny). HoG features [48] are used to describe each frame. The experiments are performed using the splits provided with the dataset [47].
Maryland dataset: The Maryland dataset [49] contains 13 different classes of dynamic scenes with ten videos per class, such as avalanches and tornados. We use the last convolutional layer of the Convolutional Neural Network (CNN) model [50] as the frame descriptors. Then the dimensionality of the CNN output is reduced into 400 using Principal Component Analysis (PCA). We randomly choose seven image sets (videos) of each class for training and the remaining three for testing.
These datasets cover different sub-tasks of image set matching. The first two are for face identification, ETH-80 is for object categorization, and the last two are for scene classification. For all the datasets, the color images are converted to gray scale levels. In order to examine the anti-interference of various methods to illumination variations, histogram equalization is not used in the preprocessing stage for all the datasets. For Honda, YTC and ETH-80 datasets, we use the raw pixel values directly as features while employing different extracted features for the rest. All the features are normalized into vectors of unit one except for the query set features in the corruption experiments.

Experiment Setup
As for the experimental settings of the proposed method, we first compress each gallery set using a dictionary learning method such as KSVD [31] to a sub-dictionary with atoms of fixed number and stack them up to form a combined dictionary. The number is set as 50,10,20,5 on Honda, YTC, ML, Traffic datasets respectively. The gallery sets of ETH-80 are stacked up directly as a dictionary without compressing, as their sizes are already small. The balance parameters λ, γ, β are often chosen among three different orders of magnitude {0.1, 0.01, 0.001}. Detailed analyses on the sensitivity of parameters are described in Section 5.3. For the kernelized version of experiments, we choose the polynomial kernel.
We compare the proposed LRCS and its kernelized extension K_LRCS with some representative ISM methods, including several methods aiming at data nonlinearity, such as kernelized versions of several existing linear models and deep learning-based DRM [3]. Their abbreviations and names are shown in Table 1. For all the methods, their parameters are tuned for the best performances. Low rank regularization on row-level sparse joint representation model Specifically, for the MSM and linear AHISD, the thresholds for determining the subspace size are both set as 0.98. For the KAHISD, we set the bounds as −L = U = τ, where the value of τ is chosen to be 5. The upper bounds of the CHISD and KCHISD are set to U = 1. For the SANP, the same weighing parameters as in [9] are taken to implement convex optimization. For the RHISCRC, we choose the l 1 -norm regularized hull, and adopt the same parameter setting as in [19] for the RHISCRC and KCHISCRC. For the NN_H and NN_J, the bandwidth parameter of the KDE kernels are chosen among {0.01,0.1,1.0,10} for different datasets. For the PDL, GCR, DRM and KRCHISD, we have the same parameter setting as in their originally proposed papers respectively. For all the kernelized methods, we use the Gaussian or polynomial kernels. Ten-fold cross validation experiments are run for each kind of test, except on YTC dataset (five-fold) and Traffic dataset (four-fold).

Sensitivity of the Parameters
There are two positive parameters controlling the balance among different terms in the objective functions of the proposed model and its kernelized extension. λ controls the class-level sparsity while γ decides the low-rankness. Classification results of LRCS and K-LRCS with different parameter settings of λ and γ on the Traffic dataset are provided in Tables 2 and 3 respectively.  The kernelized version shows more stable average recognition rates than the LRCS model, but in general, both their recognition performances do not change much. When fixing a parameter and fine tuning the other, the best performance often occurs when the two parameters are in the same order of magnitude. This implies that both of the class-level sparsity and low-rankness are at work, and they are complementary.

Comparisons among the Different Joint Representation Models
To compare the performances of various aforementioned joint representation models, we choose Maryland dataset as a representative to run experiments, due to the abundance of its data quantity and diversity covering many potential data structures. For convenience, we denote these models by some abbreviations as shown in Table 1. We also add a low rank regularization to a row-level sparse joint representation model, denoted by LR+L12, to participate in the comparisons. Table 4 shows the recognition rates of various joint representation models with 10 times of testing on Maryland dataset. Form the table, it is observed that L1 has an ordinary performance since it does not highlight any relationship within the query sets. CS enhances the inter-class discrimination utilizing class label prior, so it is supposed to perform always better than L12. At test no. 1,4,5 and 10, however, the recognition rates of CS are not higher (even lower) than that of L12. The reason is that the non-zero coefficients recovered by CS distribute rather chaotically in the potential targeted class (as shown in Figure 2c, which connives even enlarges the variations across the images within the query set so that the inherent commonality (similarity) among the images is weakened. In fact, in terms of query set, L12 highlights similarity, implying extremely low rank, while probably CS has higher rank than the truth, due to the disordered distribution of the non-zero coefficients. In contrast to that, the proposed LRCS explores low-rankness with the class-level sparsity constraints. Realizing not only enhanced inter-class discrimination by class-level sparsity constraint, but also intra-set trade-off between commonality and variations by adding a low rank regularization, the proposed LRCS model performs better for ISM. Obviously, the performance of the LR+L12 model is inferior to that of LRCS, and it is also not better (even worse) than that of L12. The reason is that the row-level sparsity has already means extreme low-rankness. Adding a low rank regularization leads the coefficients to be far from the truth, bringing worse performance. In addition, there is no significant increase in the recognition rates of K_LRCS over that of LRCS except for the smaller standard deviation meaning higher stability. Probably because the CNN features in this dataset are relatively linearly tractable and thus the linear model LRCS has already been able to present competitive matching performance.

Comparisons with the State-of-the-Art Methods
As shown in Table 5, we compare the proposed method with the state-of-the-art in terms of ISM performance (average recognition rates and standard deviations) on the five benchmark datasets. For average recognition rates, both linear and nonlinear versions of the proposed model show the best or near the best performance on all the datasets in comparison to their respective categories of methods. Furthermore, the standard deviations of our proposed models are relatively small on each dataset, indicating the large stability of image set matching.
The KCHISCRC [19] method performs almost the second best. Just like the proposed models, it also belongs to the category of unified modelling which models query and gallery sets together and measures their distances directly. To some extent, this demonstrates that the superiority of unified modelling over other methods modelling separately each set. Another unified modelling method RHISCRC [19] also performs well on most datasets but unsatisfactorily on the rest, especially on Traffic. The reason is that the l 1 -norm constraints used both a and β [19] assume implicitly that only a few extremely similar images from the query and gallery sets participate in constructing hulls and modelling, where inter-class ambiguities easily cause misclassifications. Notice that sometimes the average recognition rates of K_LRCS has little advantage over (even is inferior to) that of LRCS, just as KAHISD vs. AHISD, and KCHISD vs CHISD. It can be inferred that whether the kernelization operations can help significant improvement of performance also relies on how much nonlinearity is embedded in the features used.
We also observe that many other methods perform well on some datasets while poor on the others. That is to say, their performances are not stable, depending on the datasets. For example, RT-LSVM [4] almost fails on the Maryland dataset. A probable explanation is that, as analyzed in [4], this method has rigorous limits to the relationship among the three: the number of class labels, the number of images within the query set, and the number of training images of each class, which Maryland cannot satisfy. In addition, the deep learning-based DRM method does not necessarily outperform other methods especially on the YTC and ETH-80 datasets. For the DRM method, the first step of its model is to extract the LBP features, but for fair comparisons, by removing the feature extraction step, we use the pixel values directly just as other methods on Honda, YTC, and ETH-80 datasets. Since only extracted features rather than LBP are provided on the Maryland and Traffic datasets, DRM code is not run on them. Maybe the mandatory remove of the LBP feature extraction procedure impairs the performance of DRM (the dependence of different methods on features will be further demonstrated in Section 5.6). Moreover, the small size of datasets limits the advantages of deep learning. The good news is, our proposed method shows impressive ability of dealing with nonlinearity, which is comparable to the deep model.
In terms of running time, for a normal image set, the average running time of our method varies on different datasets due to their different complexity (data quantity in every set, feature dimension, etc). For example, for LRCS, it is 1.68 s on Honda dataset while 3.38 s on YTC dataset on a PC with Intel(R) Core(TM) i7-3700 CPU and RAM 8GB. Such a running time is common compared to other methods for ISM problem [21].

Dependence on the Features
To investigate the dependence of different methods on features (or feature extraction techniques), we evaluate their recognition performance on YTC dataset using two kinds of features: Pixel values and LBP features. The results are shown in Figure 6. For the convenience of observation, various methods are sorted according to the recognition rates using pixel feature from the largest to the smallest. Obviously, the proposed models perform well consistently using whatever types of features, and the performance gaps between different features are small. Conversely, most of the methods compared depend heavily on specific kind of features, indicating that it is necessary to choose suitable feature extraction techniques for better performances. For example the DRM method even takes the LBP feature extraction as the first step of its model, the recognition performance drops dramatically once removing or replacing the LBP feature.

Robustness Comparisons
In this section, experimental results on the Honda and ETH-80 datasets are presented to investigate the robustness of different methods to random sparse corruptions and image-specific corruptions, respectively.
To test the robustness of different methods to random partial occlusions, we choose the Honda dataset for performance evaluation. Three occluded frames from three testing videos with partial occlusion are shown in Figure 7. Since the number of the occluded videos is small, we supplement some by taking some videos from the clean subset and adding random sparse noise onto their feature (raw pixel) matrices to simulate the occluded videos. The random sparse noise is a matrix, 10% elements of which are randomly distributed values larger than 2 while the rest are all zeros. This is sufficiently gross corruptions for the normalized clean feature matrix whose entries are all between 0 and 1. To test the robustness of different methods to image-specific corruptions (outliers), we choose the ETH-80 dataset for performance evaluation. The image-specific corruptions are generated by randomly replacing 20% columns of the clean feature matrix of each query set with random noise vectors as outliers. The original feature matrix of query set is full of two or three-digit integers while the outliers are anomalous vectors composed of decimals whose amplitudes do not exceed 3. So the outliers are entirely away from the feature space of query set. Then the corrupted query-set matrices are used for testing. Table 6 gives the recognition results of various methods on two datasets of different kinds of corruptions. It is clear that the proposed robust model achieves the best matching performance under two kinds of gross corruptions, with just a little degradation from the non-corruption case. It can be observed that the proposed method has the 8% and 6% performance improvement on Honda and ETH-80 datasets respectively, compared to the second best method RHISCRC [19]. The success of RHISCRC may come from the l 1 -norm constraints used on both a and β [19] which enforce that only a few extremely similar images in query and gallery sets participate in constructing hulls and modelling. Thus the corrupted instances are circumvented. In contrast to the poor performance of RHISCRC on the Traffic dataset mentioned in Section 5.5, we can see that the l 1 -norm used here has both advantages and disadvantages, depending on different situations of data distribution. Table 6. Recognition rates and standard deviations of various methods on partially occluded Honda and image-specific corrupted ETH-80 datasets.

Methods
Honda ETH-80 AHISD [18] 8.97/3.87 35.5/3.69 CHISD [18] 68.46/6.51 14.25/3.91 SANP [9] 7.69/1.71 15/5.77 NN_H [6] 72.82/4.22 12.5/0 PDL [8] 35.38/9.03 16/5.55 RT_LSVM [4] 57.95/5.43 12.75/0.79 NN_J [6] 82.05/6.4 78.75/5.68 RHISCRC [19] 87.69/5. Furthermore, we also compare the robustness of different joint representation models under the two kinds of corruptions. In order to compare with the proposed R_LRCS fairly, we add the constrained terms standing for corruptions to the other competing joint representation models. For convenience, 'R_' abbreviation is added to their names. The recognition performances are illustrated in Figure 8. It can seen that the proposed R_LRCS presents the best recognition rates all the time for two kinds of gross corruptions, indicating the superior robustness of our proposed model. It is also observed that in the majority of the test numbers, the two modified models with low rank regularization added, R_LRCS and R_LR+L12 both perform better than their respective single-constraint models, R_CS and R_L12. The proposed method takes advantage of the competition and collaboration not only within the gallery sets but also considering the inter-instance correlation within the query set. As such, the commonality and variations across the query-set images can be achieved simultaneously through the low rankness and class-level sparsity constraints on the representation coefficients.

Conclusions
In this paper, we proposed a joint representation model with both low rank and class-level constraints imposed on the representation coefficients to explore fully the inter-instance relationships within the query set for improving image set matching. Compared with joint representation models with other sparsity constraints, the model with class-level sparsity constraint is more appropriate for image set matching, due to its enhanced inter-class discrimination by class label prior. In addition, a low rank regularization is added to the class-level sparse joint representation model, thus both the commonality and variations within the query set are exploited for improving ISM performance. To deal with nonlinear data in real scenarios, the proposed model is then extended to the kernelized version. Furthermore, grossly corrupted data are often encountered in real scenarios, which is ignored by most of the existing ISM methods. For this problem, the proposed model is extended to its robust version to tackle the rand sparse and image-specific corruptions. The optimization problems of the proposed model and its extensions are solved by employing singular value thresholding and block soft thresholding operators to solve different variables in an alternating direction manner. Experiments on five public datasets demonstrate that the proposed method has better performance than the existing joint representation models with various constraints for the ISM problem and also compares favorably with other state-of-the-art methods.