Semi-Supervised Ridge Regression with Adaptive Graph-Based Label Propagation

: In order to overcome the drawbacks of the ridge regression and label propagation algorithms, we propose a new semi-supervised classiﬁcation method named semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP). Firstly, we present a new adaptive graph-learning scheme and integrate it into the procedure of label propagation, in which the locality and sparsity of samples are considered simultaneously. Then, we introduce the ridge regression algorithm into label propagation to solve the “out of sample” problem. As a consequence, the proposed SSSRR-AGLP integrates adaptive graph learning, label propagation and ridge regression into a uniﬁed framework. Finally, an effective iterative updating algorithm is designed for solving the algorithm


Introduction
Least square regression (LSR) is a mathematical optimization algorithm that seeks the best matching function of data by minimizing the square of error [1][2][3][4]. Since the advent of least square regression, a large number of LSR-based methods have been proposed, such as weighted LSR [1], partial LSR [2], local LSR [3], kernel LSR [4], support vector machine (SVM) [5], non-negative least squares (NNLS) [6,7] and so on. Moreover, a series of methods based on LSR have been successfully and efficiently applied to face recognition, speech recognition, image retrieval, and so on [8][9][10][11][12][13][14]. For example, in order to improve the performance of retargeted least squares regression (ReLSR), Wang et al. [8] proposed the groupwise retargeted least squares regression (GReLSR) algorithm, which utilized an additional regularization to restrict the translation values of ReLSR so that similar values are within the same class. For the sake of solving the device diversity problem in crowdsourcing system, Zhang et al. [9] introduced a linear regression (LR) approach to obtain the uniform received signal strength (RSS) values. In [10], an elastic-net regularized linear regression (ENLR) framework was developed, in which two particular strategies were proposed to enlarge the margins of different classes. To obtain the orthogonal basis functions for extreme learning machine (ELM) and improve the locality preserving power of feature space, Peng et al. [11] first imposed the orthogonal constraint on the output weight matrix of ELM and then formulated an orthogonal extreme learning machine (OELM) model. To force the samples in the same class to have similar soft target labels, Yuan et al. [12] designed a constrained least square regression (CLSR) model for multi-category classification. In [13], the authors proposed a new semi-supervised learning model named the semi-supervised graph learning retargeted least squares regression model (SSGLReLSR), which integrated the linear square regression and graph construction into a unified framework. Based on the asymptotic bias and variance, a new method for multivariate predictor variables was proposed [14]. To further improve the robustness and effectiveness of LSR, many discriminative LSR methods were developed recently. For example, Xiang et al. [15] utilized the ε-dragging technique to design a general framework of discriminative least square regression (DLSR), and Zhang et al. [16] introduced the retargeted LSR by learning transformed regression. Moreover, in [17], a unified least square framework was constructed to formulate many component analysis methods and their corresponding regularized and kernel extensions versions. In addition, some traditional dimension reduction techniques are also regarded as LSR theory framework [18], such as principal component analysis (PCA) [19], linear discriminant analysis (LDA) [20], locality preserving projection (LPP) [21] and so on.
LSR is an unbiased estimation method, which is very sensitive to noise. In addition, the LSR is also unstable when the sample size is less than the dimension [22,23]. For those reasons, the ridge regression (RR) was proposed by adding a regularization item into the LSR model to improve the performance and reduce the computational complexity [22,23]. The RR is an improved least square estimation, which can be regarded as a biased estimation regression for collinear data analysis. Experimental results showed that the algorithm was effective and robust in "pathological data" [24]. Especially, its performance is more reliable than the LSR in practical applications. Over the decades, many improved methods have been studied [25][26][27][28][29][30][31][32][33][34][35]. The authors of [29] studied a dual version of the ridge regression procedure, which can perform non-linear regression by constructing linear regression function in high dimensional feature space. Xue et al. [30] presented a local ridge regression (LRR) algorithm to effectively solve the illumination variation and partial occlusion problems of facial images. To deal with the singular problem in the extreme learning machine learning algorithms, [31] proposed an extreme learning machine ridge regression (ELMRR) learning algorithm with ridge parameter optimization. For reducing the computation time and retaining statistical optimality, Zhang et al. [32] suggested a decomposition-based scalable approach to perform kernel ridge regression (KRR) [33], which randomly divided a dataset of size N into m subsets with equal size. The KRR calculated an independent estimator for each subset, and then averaged the local solutions into a global predictor. In order to improve the training efficiency with large-scale data, [34] proposed an accelerator for kernel ridge regression algorithms based on data partition (PP-KRR). To address the limitation of LSR method which ignored the correlation among samples, Wen et al. [18] presented the inter-class sparsity based discriminative least square regression (ICS_DLSR) algorithm for multi-class classification. In order to improve the performance of dictionary learning, a locality-constrained and label embedding dictionary learning (LCLE-DL) [35] algorithm was proposed by considering both the locality of samples and label information.
The aforementioned approaches are supervised algorithms, which can make full use of labeled sample information, but they cannot utilize unlabeled samples information adequately. However, the labeled samples are finite since they are needed to be obtained artificially [36]. Moreover, the cost of manually labeled samples is too high due to the usage of manpower, material resources and energy. Hence, it is difficult to satisfy the requirement of for real-life sample [37]. Conversely, the unlabeled samples can easily be collected from the Internet, web chatting, digital camera of surveillance and so on [38]. As a result, it is necessary to utilize the information of unlabeled samples to improve the performance of ridge regression algorithm [39,40]. Rwebangira et al. [39] extended the local linear regression to the local linear semi-supervised regression (LLSSR) by adding manifold regularization. To reduce the distributed error and enlarge the number of data subsets using unlabeled data, Chang et al. [40] provided an error analysis for distributed semi-supervised learning with kernel ridge regression (DSKRR) based on a divide-and-conquer strategy. Although these semi-supervised ridge regression algorithms utilized the labeled and unlabeled data simultaneously, the distribution relationships between the labeled and unlabeled data were not considered at all.
Considering the computational efficiency and effectiveness, label propagation (LP) has attracted much attention in the study of the graph-based semi-supervised learning [41,42]. The core idea of LP is to propagate the information of labeled data to unlabeled data by constructing a weighted undirected graph. In the past few years, many LP algorithms such as gaussian fields and harmonic functions (GFHF) [43] and local and global consistency (LGC) [44] have been proposed. These methods can employ both labeled and unlabeled samples during training processing, but their performances heavily depended on the underlying geometrical structure of the original data distribution. In addition, they fail to build an explicit classifier for the testing samples or new coming samples. Therefore, these methods cannot directly obtain the label information of the testing samples. In order to deal with the first problem, many sophisticated graph construction algorithms have been studied in recent years [45][46][47]. To some extent, they alleviated the limitations of traditional k nearest neighbor or ε ball graphs, but their graph construction process was independent of the subsequent LP task. In other words, the graph structure was fixed in the process of LP.
In this paper, a novel semi-supervised classification method named semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP) is developed to overcome the drawbacks of existing algorithms. First, inspired by the great success of sparse representation or sparse code based classification [48,49], an adaptive graph-based label propagation algorithm is presented to construct a graph dynamically. During the graph learning, both the locality and sparsity of data are considered simultaneously to optimize the graph structure for improving the performance of label prediction. Second, in order to utilize the predicted label information of unlabeled samples adequately and deal with the "out of sample" problem, the ridge regression algorithm is introduced into SSRR-AGLP. Finally, a simple and effective iterative optimization algorithm is designed to solve the proposed algorithm. To evaluate the performance of the proposed SSRR-AGLP, five benchmark facial image databases (Yale, ORL, Extended YaleB, CMU PIE and AR) are employed in this work. By comparing the experimental results of our SSRR-AGLP with some well-known related methods, the effectiveness and superiority of the proposed method can be justified. The flowchart of the proposed approach is illustrated in Figure 1. simultaneously, the distribution relationships between the labeled and unlabeled data were not considered at all. Considering the computational efficiency and effectiveness, label propagation (LP) has attracted much attention in the study of the graph-based semi-supervised learning [41,42]. The core idea of LP is to propagate the information of labeled data to unlabeled data by constructing a weighted undirected graph. In the past few years, many LP algorithms such as gaussian fields and harmonic functions (GFHF) [43] and local and global consistency (LGC) [44] have been proposed. These methods can employ both labeled and unlabeled samples during training processing, but their performances heavily depended on the underlying geometrical structure of the original data distribution. In addition, they fail to build an explicit classifier for the testing samples or new coming samples. Therefore, these methods cannot directly obtain the label information of the testing samples. In order to deal with the first problem, many sophisticated graph construction algorithms have been studied in recent years [45][46][47]. To some extent, they alleviated the limitations of traditional k nearest neighbor or ε ball graphs, but their graph construction process was independent of the subsequent LP task. In other words, the graph structure was fixed in the process of LP.
In this paper, a novel semi-supervised classification method named semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP) is developed to overcome the drawbacks of existing algorithms. First, inspired by the great success of sparse representation or sparse code based classification [48,49], an adaptive graph-based label propagation algorithm is presented to construct a graph dynamically. During the graph learning, both the locality and sparsity of data are considered simultaneously to optimize the graph structure for improving the performance of label prediction. Second, in order to utilize the predicted label information of unlabeled samples adequately and deal with the "out of sample" problem, the ridge regression algorithm is introduced into SSRR-AGLP. Finally, a simple and effective iterative optimization algorithm is designed to solve the proposed algorithm. To evaluate the performance of the proposed SSRR-AGLP, five benchmark facial image databases (Yale, ORL, Extended YaleB, CMU PIE and AR) are employed in this work. By comparing the experimental results of our SSRR-AGLP with some well-known related methods, the effectiveness and superiority of the proposed method can be justified. The flowchart of the proposed approach is illustrated in Figure 1. The remainder of the paper is organized as follows. Least square regression (LSR), ridge regression (RR) and label propagation (LP) are reviewed briefly in Section 2. The proposed Semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP) is The remainder of the paper is organized as follows. Least square regression (LSR), ridge regression (RR) and label propagation (LP) are reviewed briefly in Section 2. The proposed Semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP) is described in Section 3 concretely. The experimental results and analysis are illustrated in Section 4. Finally, the conclusions and further work are presented in Section 5.

Related Works
In this section, the least square regression (LSR), ridge regression (RR) and label propagation (LP) algorithms are reviewed briefly.

Least Square Regression
Least squares regression (LSR) plays a very important role in machine learning, which seeks the best matching function of data by minimizing the sum of squares of errors [1,2]. The error is the difference between the predicted value and the real value. Suppose that X = [x 1 , x 2 , ..., x n ] ∈ R d×n is a dataset, where n is the sample number, d is the feature number of each sample and each sample has a corresponding label vector which is represented by y ∈ R c×1 , where c is the number of classes.
For the LSR model, Y = [y 1 , y 2 , ..., y n ] ∈ R c×n is defined to represent the class indicator matrix, where y i is the label vector of the i-th sample. If the i-th (i = 1, 2, ..., n) sample belongs to j-th (j = 1, 2, ..., c) class, then the label vector of the i-th sample is y i = [0, ..., 0, 1, 0, ..., 0] ∈ R c×1 , in which only the j-th element equals to one. The optimization problem of LSR is formulated as: represent the weight matrix and the bias vector, respectively. The objective of LSR is to find an optimal transformation matrix for minimizing the value of the error function as Equation (1). As described in [50], the error function can be reformulated as: where W = [w 1 , w 2 , ..., w n ] ∈ R (d+1)×n denotes a matrix and is also a vector. The optimal W can be obtained by minimizing the following function: where X = [x 1 , x 2 , ..., x n ] ∈ R (d+1)×n . Setting the derivative W to zero, we can obtain the optimal transformation matrix as follows: where (·) −1 denotes the matrix inverse operation.

Ridge Regression
From Equation (4), we can see that when the sample number n is less than the feature number d (n ≺ d), the matrix XX T is a singular matrix, which will lead to the non-uniqueness of the solution.
To overcome the shortcoming of LSR, a regularized LSR method named ridge regression (RR) [22] has been proposed, which integrates l 2 -regularization into LSR. The objective function of RR is defined as follows: where λ > 0 is a regularization parameter. Taking the derivative of Equation (5) with respect to W and then setting it to zero, we have the matrix W:

Label Propagation
Suppose there is a sample set X = [x 1 , x 2 , ..., x l , x l+1 , ..., x u ] ∈ R d×n from c class, in which the first l(l < n) samples X l = [x 1 , x 2 , ..., x l ] ∈ R l×n are labeled and the rest n − l samples X u = [x l+1 , x l+2 , ..., x n ] ∈ R (n−l)×n are unlabeled. Let Y = [y 1 , y 2 , ..., y n ] ∈ {0, 1} c×n be a label matrix. Specfically, we set y ij = 1, if x i is a labeled sample belonging to the j-th class, otherwise y ik = 0 (k = j). The core purpose of the label propagation is to estimate the labels of the unlabeled data. We define G as a weighted undirected graph, in which each node corresponds to a data sample in X. The weight of edge between x i and x j is defined as where N k (x i ) is the set of k-nearest neighbors of x i and σ is a parameter which determines the decay rate of heat kernel function.
.., f n ] ∈ R c×n be a predicated label matrix, in which f i ∈ R c is the corresponding predicated label vector of x i . In terms of the LP theory [43,44], the predicated labels of the nodes in graph G is estimated as where µ is a balance parameter which controls the discrepancy between the predicated label vector and true label vector for the labeled samples. From Equation (7), we can find that the weighted undirected graph in the LP is constructed in advance and remains unchanged during the label propagation procedure. Furthermore, seen from the objective function in Equation (8), the LP can propagate the label information of the labeled samples to the unlabeled ones in the training dataset. However, since the LP fails to provide an explicit classifier, it suffers from the "out-of-samples" problem.

The Proposed Method
In this section, we first propose the semi-supervised ridge regression with adaptive graph-based label propagation (SSRR-AGLP) algorithm, which integrates adaptive graph learning, label propagation and ridge regression into a unified framework. Next, an optimization method based on iterative updating rules is designed to solve the objective function of SSRR-AGLP. Then, the convergence analysis of the optimization algorithm is also provided. Finally, the classification criterion for the testing samples is given.

Objective Function of SSRR-AGLP
In the proposed method, we first divide the dataset into training set and testing set, and then the training set is subdivided into labeled sample set and unlabeled sample set. Suppose the training sample set is represents the labeled samples of the training set and X u = [x l+1 , x l+2 , ..., x l+u ] ∈ R d×(n−l) represents unlabeled samples of the training set. In addition, n represents the number of training samples and d represents the feature dimension of the samples. Y = [y 1 , y 2 , ..., y l , y l+1 , ..., y n ] ∈ {0, 1} c×n is the labeled matrix of training samples, where y i ∈ R c×1 (1 ≤ i ≤ n) represents the label vector of the i-th sample, c is the total number of classes, and y ij is the j-th element in y i . If x i is a labeled sample and belongs to the j-th class, then y ij = 1, otherwise y ij = 0. If x i is an unlabeled sample, all elements in y i are 0, that is, ∀i > l, y i = 0 ∈ R c×1 .
The first aim of the proposed SSRR-AGLP is to utilize the label information of labeled samples to predict the label of unlabeled samples. Therefore, the LP algorithm is adopted in this work. However, different from the traditional LP algorithm, a relationship graph between samples is first constructed by the reconstructive coefficient of samples. The objective function is formulated as follows: is the reconstruction coefficient vector of sample x i , and its element value denotes the weight of an edge between the sample x i and other samples in the graph. In order to make the reconstruction coefficients physically more meaningful, the non-negative constraint for the reconstruction coefficients is introduced in our model. The non-negativity constraint can enhance the discriminability of the reconstruction coefficients, so that the sample will more likely be reconstructed by the samples from the same clustering. The connection relationship between sample pairs in the graph is represented by reconstructive coefficients. For instance, if a coefficient of the sample pair x j and x i is non-zero, then there exists an edge between them in the graph and the weight of edge S ij is set as the coefficient corresponding to x j . Thus, the local information of data can be exploited by selecting the nearest neighbors of a sample. As mentioned above, the locality and sparsity constraint term for S can be defined as follows: min where || · || 1 is the l 1 -norm of a matrix and is the element-wise multiplication. E = [e ij ] ∈ R n×n is the local adaptation matrix, in which the element e ij is defined as follows: From Equation (11), we can clearly find that a smaller e ij indicates x i is more similar to x j and vice versa. Hence, minimizing Equation (10) with respect to S will assign small or nearby zero to reconstructive coefficients of samples which are far from x i . This means that if two samples are distant from each other, they are unlikely to be connected in the graph.
Combining Equation (9) with (10), the adaptive graph model in the label propagation is generated as follows: min where α > 0 is a balance parameter which controls the importance of the locality and sparsity constraint term. Subsequently, let F = [ f 1 , f 2 , ..., f n ] ∈ R c×n denote the prediction label matrix, where f i ∈ R c×1 is the column vector that denotes the probability of the sample x i belonging to each class. For example, the largest value of f ij means the highest probability of the sample x i belonging to j-th class. In terms of the LP theory [37,38], the nearby or similar samples and the samples from the same global cluster should share similar labels. Therefore, the objective function of LP is defined as follows: where R = (S + S T )/2 denotes the weight matrix. Moreover, in order to make the predicted labels and true labels of labeled samples be close as possible, we introduce the penalty constrain term as: where U is a selection diagonal matrix and element u ij is defined as follows: Combining Equations (12)- (14), the adaptive graph label propagation model is defined as follows: where β > 0 is a balance parameter which controls the importance of the label propagation term. The second aim of our proposed SSRR-AGLP is to make full use of the predicted labels of samples to learn a classifier function. Thus, the ridge regression model is introduced into our method and the objective function of RR is: min where γ > 0 is a balance parameter which prevents the RR model from over-fitting. Finally, combining Equation (16) with (17), the objective function of the SSRR-AGLP is formulated as follows: From Equation (18), it clearly can be seen that the proposed model combines the adaptive graph, label propagation algorithm and ridge regression algorithm together to solve the following problems: (a) By introducing the LP algorithm into our method, it can solve the problem that the traditional RR algorithm cannot make use of unlabeled information. In addition, it can learn a classifier to deal with the "out-of-sample" problem.
(b) By integrating the adaptive graph learning and the LP algorithm into a unified framework, it can break the defect of traditional LP algorithm which needs to construct a graph in advance.

Optimization Solution
In Equation (18), there are three variables, the transformation matrix W, prediction label matrix F and weight matrix S, in the objective function. Unfortunately, the objective function of our algorithm is not convex in all these three variables together, so the global optimal solution cannot be given directly. In order to solve this problem, in this subsection, an iterative optimization solution is proposed, which is performed by fixing two variables and updating another variable.

Fix Transformation Matrix W and Prediction Label Matrix F to Solve Weight Matrix S
Removing the items that are not related to the matrix S in Equation (18), the optimization problem of the variables S can be obtained as follows: Through a series of algebraic formulations, Equation (19) is simplified to min S ε(S) = tr(X T X − 2S T X T X + S T X T XS) + αtr(ES) + βtr(QS) s.t. S > 0 (20) where Q = [q ij ] n×n ∈ R n×n and the element q ij is defined as q ij = || f i − f j || 2 2 . To solve the above problem, we need to introduce the Lagrange multiplier matrix ψ. The Lagrange function of Equation (20) is: Setting the derivative with respect to S to zero, we obtain According to the KKT condition ψ ij S ij = 0 [51], we obtain According to Equation (23), the updating rule of S is

Fix Weight Matrix S and Transformation Matrix W to Solve Prediction Label Matrix F
Removing the items that are not related to the matrix F in Equation (18), the optimization problem of the variable F can be obtained as follows: Through a series of algebra formulations, the Equation (25) is simplified to where D is a diagonal matrix with its entries being D ii = ∑ n j R ij . To solve the problem of Equation (26), we also need to introduce the Lagrange multiplier matrix Λ. The Lagrange function of Equation (26) is: Setting the derivative with respect to F to zero, we obtain According to the KKT condition Λ ij W ij = 0 [51], we obtain According to Equation (29), the updating rule of F is

Fix Weight Matrix S and Prediction Label Matrix F to Solve Transformation Matrix W
Removing the items that are not related to the matrix W in Equation (18), the optimization problem of the variable W can be obtained as follows: Through a series of algebra formulations, the Equation (31) is simplified to To solve the above problem, we need to introduce the Lagrange multiplier matrix θ. The Lagrange function of Equation (32) is: The partial derivation of Equation (33) with respect to W is Setting the derivative is equal to zero, we obtain According to the KKT condition θ ij W ij = 0 [51], we obtain According to Equation (36), the updating rule of W is:

The Optimization Algorithm
In summary, we provide the primary optimization procedure of the proposed algorithm in Algorithm 1.

Algorithm 1. The algorithm to solve the objective function of SSRR-AGLP
Input: the training set X = [x 1 , x 2 , ..., x l , x l+1 , . . . x n ] = [X l , X u ] ∈ R d×n , the label matrix of training set Y = [y 1 , y 2 , ..., y l , y l+1 , ..., y n ] ∈ {0, 1} c×n 1: Initialization: parameters α and β = 1 and γ = 0.01, matrices W t ∈ R d×c , F t ∈ R c×n and S t ∈ R n×n are an arbitrary nonnegative matrix, t = 0 2: According to Equations (11) and (15), the diagonal matrices E ∈ R n×n and U ∈ R n×n are calculated respectively. 3: Repeat steps 3-9 until convergence conditions 4: According to S t , calculate R t = (S t +S T t ) 2 , and then calculate matrix Clearly, the updating of S, F and W is calculated alternately in each iteration of Algorithm 1 which indicates the process of the graph, label propagation and classifier learning are jointly implemented in our proposed SSRR-AGLP. In addition, the predicted label matrix (F) and the transformation matrix (W) of RR are mutually affected in each iteration, which makes both the classifiers and predicted labels more accurate in our algorithm.

Classification Criterion
Given a testing sample x test , its predicted label vector f = [ f 1 , f 2 , ..., f c ] T ∈ R c×1 is computed by follows: Then, we adopt f to assign the single class label for the testing data and the rule is:

Convergence Analysis
The convergence of the updating rules in Equations (24), (30) and (37) is analyzed in this section. Similar to [50], the definition of the auxiliary function is first given: The auxiliary function plays a very important role in the following lemma.

Lemma 1.
If ϑ is an auxiliary function of φ, then it is non-increasing with the following updating formula. Proof.
Firstly, we present the updating rule for S in Equation (24) which is exactly the updating in Equation (40) with a proper auxiliary function. Considering any element S ij in S, φ ij (S ij ) denotes the part of the objective function of SSRR-AGLP, which is only relevant to S ij , as follows: where φ ij (S ij ) and φ ij (S ij ) are the first-order and second-order derivatives of the objective function with respect to S ij .

Lemma 2.
The function ϑ ij (S ij , S t ij ) is an auxiliary function for φ ij (S ij ), which is formulated as: Proof. We first generate the Taylor series expansion of φ ij (S ij ) According to Equation (45), Then, we obtain Thus, Equation (47) holds and ϑ ij (S ij , S t ij ) ≥ φ ij (S ij ). Furthermore, we can see that ϑ ij (S ij , S t ij ) = φ ij (S ij ).
Secondly, we indicate the updating rule for F in Equation (30) which is exactly updating in Equation (40) with a proper auxiliary function. Considering any element F ij in F, φ ij (F ij ) is used to show the part of the objective function of SSRR-AGLP which is only relevant to F ij , as follows: where φ ij (F ij ) and φ ij (F ij ) are the first-order and second-order derivatives of the objective function with respect to F ij .

Lemma 3.
The function ϑ ij (F ij , F t ij ) is an auxiliary function for φ ij (F ij ), which is defined as: Proof. First, the Taylor series expansion of φ ij (F ij ) is: According to Equation (52), Then, we have Thus, Equation (54) holds and Subsequently, we describe the updating rule for W in Equation (37), which is correctly updated in Equation (40) with a proper auxiliary function. For any element W ij in W, φ ij (W ij ) is adopted to represent the part of the objective function of SSRR-AGLP, which is only relevant to W ij . It is defined as: where φ ij (W ij ) and φ ij (W ij ) are the first-order and second-order derivatives of the objective function with respect to W ij .

Lemma 4. The function
is an auxiliary function for φ ij (W ij ).
Proof. The Taylor series expansion of φ ij (W ij ) is: From Equation (60), we can find that Then, we have Thus, Equation (62) holds and ϑ ij (W ij , W t ij ) ≥ φ ij (W ij ). Furthermore, we can see that Finally, in order to demonstrate the convergence of the updating rules in Equations (24), (30) and (37), we give the following theorem about the three updating rules: Theorem 1. For S ≥ 0, F ≥ 0 and W ≥ 0, the objective function in Equation (14) is non-increasing with the updating rules in Equations (24), (30) and (37).

Experiment and Analysis
To evaluate the performance of the proposed SSRR-AGLP for classification, we test it on five facial image databases (Yale [52], ORL [53], Extended YaleB [54], AR [55] and CMU PIE [56]). The detailed information of the five databases utilized in our experiments is shown in Table 1 and some sample images from them are displayed in Figure 2. In all experiments, 50% samples of each subject are selected randomly for training and the remaining samples are used for testing in each database. Specifically, a part of the samples is randomly selected as labeled data in each training set, and the numbers of selected training and testing sets for experiments are shown in Table 2.     Moreover, to justify the superiority of the proposed SSRR-AGLP, we compare it with some well-known algorithms, such as traditional k-nearest neighbor (KNN) [57], least squares regression (LSR) [1], ridge regression (RR) [22], inter-class sparsity based discriminative least square regression (ICS_DLSR) [18] and the locality-constrained and label embedding dictionary learning (LCLE-DL) [35].

Parameter Setting
According to Section 3, there are three parameters, i.e., α, β and γ, that need to be determined in the objective function of the proposed SSRR-AGLP. The best values of these parameters are tuned by searching the grid {0.0001, 0.001, 0.01, 0.1, 1, 10, and 100} in an alternate manner. We first fix parameter γ as 0.01, and the influences of the parameters α and β on different datasets are illustrated in Figure 3. From these results, the tendencies of α and β on the accuracy rates increase first and then decrease for all the databases, which indicates that locality constraint term and label propagation can improve the performance of SSRR-AGLP. Moreover, when the values of parameters are set between 1 and 10, the performance of our SSRR-AGLP is insensitive to the parameter values. Then, we fix parameters α and β with best values, and the results of our algorithm under various values of parameter γ on the different databases are listed in Table 3. From this table, we can find the proposed SSRR-AGLP reaches the best performance when the parameter γ is set as 0.01 for all the databases. Moreover, when the parameters are set between 0.001 and 0.1, the performance of our SSRR-AGLP is insensitive to the parameter value of γ. Finally, we can find the best parameter values for our model are {α = 1, β = 10 and γ = 0.01} on Yale database, {α = 1, β = 1 and γ = 0.01} on ORL database, {α = 10, β = 1 and γ = 0.01} on Extended YaleB database, {α = 1, β = 1 and γ = 0.01} on AR database and {α = 10, β = 1 and γ = 0.01} on CMU PIE database. Therefore, according to the experimental results, we can set the parameters α and β as relatively larger values (1 and 10), and set the parameter γ as a small value (from 0.001 to 0.1) for the application tasks.

Experimental Results and Analysis
To fairly compare the performances of our approach and other algorithms, the average accuracy rates and standard deviations over 10 random training sample selection procedures of different algorithms are listed in Table 4. From Table 4, we can get the following observations: (1) KNN is sensitive to noise; its accuracy rate is lower than other methods. (2) The performance of RR is better than that of LSR, which indicates that RR can avoid the over-fitting problem effectively. (3) The accuracy rate of ICS_DLSR is higher than that of LSR and RR. This is because that the inter-class sparsity regularization term can improve discriminative ability of transformation matrix adequately. (4) By exploring the locality structure of the learned dictionary, the performance of LCLE_DL is better than that of RR and ICS_DLSR in most cases, which implies that the locality of data can improve the performance effectively. (5) The performance of the proposed SSRR-AGLP is consistently superior to other algorithms for five face image databases. It is because SSRR-AGLP not only employs both the labeled and unlabeled data to train the classifiers, but also takes advantage of the locality and sparsity of data for adaptive graph construction. In the second experiment, to verify the performances of the proposed SSRR-AGLP under different numbers of labeled samples, we select 50% samples as the training set for each facial database, i which different numbers of samples are selected as labeled samples for each training set. Table 5 lists the best average accuracy rates obtained by SSRR-AGLP with varied numbers of labeled samples. We can see that with the increase in labeled samples, the performance of the proposed SSRR-AGLP is improved gradually. In the third experiment, the predicted labels of SSRR-AGLP are compared with the traditional label propagation algorithms to evaluate the performance of SSRR-AGLP. Specifically, the number of the selected label samples for each database is set as the same as in the first experiment, and the accuracy of the predicted labels by SSRR-AGLP and LP is listed in Table 6. It can be seen that the performance of SSRR-AGLP is superior to that of LP due to the cooperation of adaptive graph leaning and label propagation in it. In other words, compared with the traditional LP which predicts the labels by constructing a graph in advance, adaptive graph learning in SSRR-AGLP can improve the predicted performance of SSRR-AGLP. Besides, according to the predicted labels of traditional LP algorithm, we utilize the ridge regression (RR) algorithm (denoted as LP + RR) for classification and the results for the five databases are listed Table 7. From Table 7, we can see that the performance of LP + RR is better than that of the RR algorithm, but still worse than that of SSRR-AGLP. This indicates that combining both of the LP and RR algorithms into a uniform framework can provide a benefit for classification tasks. Finally, Figure 4 displays the convergence curves of SSRR-AGLP for five facial image databases. In this figure, the x-axis and y-axis denote the number of iterations and the value of the objective function, respectively. We can see that the proposed iterative updating algorithm converges very fast (usually within 20 iterations). Finally, Figure 4 displays the convergence curves of SSRR-AGLP for five facial image databases. In this figure, the x-axis and y-axis denote the number of iterations and the value of the objective function, respectively. We can see that the proposed iterative updating algorithm converges very fast (usually within 20 iterations).

Conclusions
In this paper, we present a semi-supervised ridge regression algorithm which combines ridge regression with graph learning and label propagation into a unified framework. Compared with other approaches, the proposed SSRR-AGLP not only can adaptively construct graph based on the locality and sparsity of data, but also overcome the "out-of-sample" problem. Moreover, we design an effective iterative updating algorithm to solve the proposed framework and the convergence

Conclusions
In this paper, we present a semi-supervised ridge regression algorithm which combines ridge regression with graph learning and label propagation into a unified framework. Compared with other approaches, the proposed SSRR-AGLP not only can adaptively construct graph based on the locality and sparsity of data, but also overcome the "out-of-sample" problem. Moreover, we design an effective iterative updating algorithm to solve the proposed framework and the convergence analysis is also provided accordingly. Extensive experiments indicate the effectiveness and superiority of the proposed SSRR-AGLP.
From the experimental results, the performance of the proposed SSRR-AGLP is affected by the parameter values. Therefore, how to extend our framework to a parameter-free approach is one focus in our future research. Furthermore, we only take one distance measurement, i.e., the exponential function in Equation (9), to characterize the local information of input data in this study. Hence, introducing more distance measurements (such as Euclidean distance, inner-product and so on) into our SSRR-AGLP so that the local geometrical structure of data can be better exploited is also future work.