An Efﬁcient Algorithm for Solving the Matrix Optimization Problem in the Unsupervised Feature Selection

: In this paper, we consider the symmetric matrix optimization problem arising in the process of unsupervised feature selection. By relaxing the orthogonal constraint, this problem is transformed into a constrained symmetric nonnegative matrix optimization problem, and an efﬁcient algorithm is designed to solve it. The convergence theorem of the new algorithm is derived. Finally, some numerical examples show that the new method is feasible. Notably, some simulation experiments in unsupervised feature selection illustrate that our algorithm is more effective than the existing algorithms. MSC:


Introduction
Throughout this paper, we use R n×m to denote the set of m × n real matrices. We write B ≥ 0 if the matrix B is nonnegative. The symbols Tr(B), B T stand for the trace and transpose of the matrix B, respectively. The symbol α stands for the l 2 -norm of the vector α, i.e., α = (α T α) 1 2 . The symbol B F stands for the Frobenius norm of the matrix B. The symbol I p stands for the p × p identity matrix. For the matrices A and B, A B denotes the Hadamard product of A and B. The symbol max{x, y} represents the greater of x and y.
In this paper, we consider the following symmetric matrix optimization problem in unsupervised feature selection. Problem 1. Given a matrix A ∈ R n×m , consider the symmetric matrix optimization problem min X,Y Here A = [ f 1 , f 2 , · · · , f m ] ∈ R n×m is the data matrix, X ∈ R m×p is the indicator matrix (feature weight matrix) and Y ∈ R p×m is the coefficient matrix.
Problem 1 arises in unsupervised feature selection, which is an important part of machine learning. This can be stated as follows. Data from image processing, pattern recognition and machine learning are usually high-dimensional data. If we deal with these data directly, this may increase the computational complexity and the memory of the algorithm. In particular, it may lead to the overfitting phenomenon for the machine learning model. Feature selection is a common dimension reduction method, the goal of which is to find the most representative feature subset from the original features, that is to say, for a given original high-dimensional data matrix A, we must find out the relationship between the original feature space and the subspace generated by the selected feature. Feature selection can be formalized as follows. min I d(Span(A), Span(A c )) = A I c − A I Y F , s.t.|I| = k, (2) where I denotes the index set of the selected features and Y is the coefficient matrix of the initial feature space in the selected features. From the viewpoint of matrix factorization, feature selection is expressed as follows min X,Y Considering that the data in practical problems are often nonnegative, we add a constraint to guarantee that any feature is described as the positive linear combination of the selected features, so the problem in (3) can be rewritten as in (1).
In the last few years, many numerical methods have been proposed for solving optimization problems with nonnegative constraints, and these methods can be broadly classified into two categories: alternating gradient descent methods and alternating nonnegative least squares methods. The most commonly used alternating gradient descent method is the multiplicative update algorithm [1,2]. Although the multiplicative update algorithm is simple to implement, it lacks a convergence guarantee. The alternating nonnegative least squares method is used to solve nonnegative subproblems. Many numerical methods, such as the active method [3], the projected gradient method [4,5], the projected Barzilai-Borwein method [6,7], the projected Newton method [8] and the projected quasi-Newton method [9,10], have been designed to solve these subproblems.
For optimization problems with orthogonal constraints, which are also known as optimization problems on manifolds, there are many algorithms to solve this type of problem. In general, these can be divided into two categories: the feasible method and the infeasible method. The feasible method means that the variance obtained after each iteration must satisfy the orthogonal constraint. Many traditional optimization algorithms, such as the gradient method [11], the conjugate method [12], the trust-region method [13], Newton method [8] and the quasi-Newton method [14], can be used to deal with optimization problems on manifolds. Wen and Yin [15] proposed the CMBSS algorithm, which combined the Cayley transform and the curvilinear search approach with BB steps. However, the computational complexity increases when the number of variables or the amount of data increases, resulting in the low efficiency of this algorithm. Infeasible methods can overcome this disadvantage when facing high-dimensional data. In 2013, Lai and Osher [16] proposed a splitting method based on Bregman iteration and the ADMM method for orthogonality constraint problems. The SOC method is a valid and efficient method for solving the convex optimization problems but the proof of its convergence is still uncertain. Thus, Chen et al. [17] put forward a proximal alternating augmented Lagrangian method to solve such optimization problems with a non-smooth objective function and non-convex constraint.
Some unsupervised feature selection algorithms based on matrix decomposition have been proposed and have achieved good performance, such as SOCFS, MFFS, RUFSM, OPMF and so on. Based on the orthogonal basis clustering algorithm, SOCFS [18] does not explicitly use the pre-computed local structure information for data points represented as additional terms of their objective functions, but directly computes latent cluster information by means of the target matrix, conducting orthogonal basis clustering in a single unified term of the objective function. In 2017, Du S et al. [19] proposed RUFSM, in which robust discriminative feature selection and robust clustering are performed simultaneously under the l 2,1 -norm, while the local manifold structures of the data are preserved. MFFS was developed from the viewpoint of subspace learning. That is, it treats feature selection as a matrix factorization problem and introduces an orthogonal constraint into its objective function to select the most informative features from high-dimensional data.
OPMF [20] incorporates matrix factorization, ordinal locality structure preservation and inner-product regularization into a unified framework, which can not only preserve the ordinal locality structure of the original data ,but also achieve sparsity and low redundancy among features.
However, research into Problem 1 is very scarce as far as we know. The greatest difficulty is how to deal with the nonnegative and orthogonal constraints, because the problem has highly structured constraints. In this paper, we first use the penalty technique to deal with the orthogonal constraint and reformulate Problem 1 as a minimization problem with nonnegative constraints. Then, we design a new method for solving this problem. Based on the auxiliary function, the convergence theorem of the new method is derived. Finally, some numerical examples show that the new method is feasible. In particular, some simulation experiments in unsupervised feature selection illustrate that our algorithm is more efficient than the existing algorithms.
The rest of this paper is organized as follows. A new algorithm is proposed to solve Problem 1 in Section 2 and the convergence analysis is given in Section 3. In Section 4, some numerical examples are reported. Numerical tests on the proposed algorithm applied to unsupervised feature selection are also reported in that section.

A New Algorithm for Solving Problem 1
In this section we first design a new algorithm for solving Problem 1; then, we present the properties of this algorithm.
Problem 1 is difficult to solve due to the orthogonal constraint X T X = I p . Fortunately, this difficulty can be overcome by adding a penalty term for the constraint. Therefore, Problem 1 can be transformed into the following form where ρ is a penalty coefficient. This extra term is used to penalize the divergence between X T X and I p . The parameter ρ is chosen by users to make a trade-off between making 1 2 A − AXY 2 F small, while ensuring that X T X − I p 2 F is not excessively large. Let the Lagrange function of (4) be where α ∈ R m×p and β ∈ R p×m are the Lagrangian multipliers of X and Y. It is straightforward to obtain its gradient functions as follows Setting the partial derivatives of X and Y to zero, we obtain Noting that (X, Y) is the stationary point of (4), if it satisfies the KKT conditions According to (6) and (7) we can obtain the following iterations which are equivalent to the following update formulations However, the iterative formulae (10) and (11) have two drawbacks, as follows.
(1) The denominator of (10) or (11) may be zero, which violates the rule of fractional operation.
the convergence cannot be guaranteed under the updating rule of (10) and (11).
In order to overcome these difficulties, we designed the following iterative methods where Here, σ is a small positive number which can guarantee the nonnegativity of every element of X and Y. So we can establish a new algorithm for solving Problem 1 as follows (Algorithm 1).

Algorithm 1:
This Algorithm attempts to Solve Problem 1.
Input Data matrix A ∈ R n×m , the number of selected features p, parameters σ, ρ and δ. Output An index set of selected features I ⊆ {1, 2, · · · , m} and |I| = p.
Fix Y and update X by (2.9); 5. Fix X and update Y by (2.10); 6. Until convergence condition has been satisfied, otherwise set k := k + 1 and turn to step 3. 7. End for 8. X = (x 1 , x 2 , . . . , x n ) T . Compute x i and sort them in a descending order to choose the top p features.
The sequences {X (k) } and {Y (k) } generated by Algorithm 1 have the following property.
Proof. It is obvious that the conclusion is true when k = 0. Now we will consider the case k > 0.
Case I.
Case II.
Noting that max{X Based on the fact that max{Y

Convergence Analysis
In this section, we will give the convergence theorem for Algorithm 1. For the objective function of Problem (4), we first prove that where X (k) and Y (k) are the k-th iteration of Algorithm 1, then obtain the limit point as the stationary point of Problem (4). In order to develop this section, we need a lemma.

Lemma 1 ([21]
). If there exists a function G(u, u ) of H(u) satisfying then H(u) is non-increasing under the update rule u (k+1) = arg min u G(u, u (k) ).
Here G(u, u ) is called an auxiliary function of H(u) if it satisfies (16).

Theorem 2.
Fixing X, the objective function F(X, Y) is non-increasing, that is Proof. Set A = [ f 1 , f 2 , · · · , f m ] and Y = (y 1 , y 2 , · · · , y m ), then Noting that and when X is fixed we can ignore the constant term is a nonincreasing function. Noting that H(y) is a quadratic function and its second-order Taylor approximation at y (k) is as follows H(y) = H(y (k) ) + (y − y (k) ) T ∇H(y (k) ) + 1/2(y − y (k) ) T ∇ 2 H(y (k) )(y − y (k) ), where ∇H(y (k) ) = −X T A f + X T A T AXy (k) , and ∇ 2 H(y (k) ) = X T A T AX.
In fact , we can prove that the matrix P − X T A T AX is a positive semi-definite matrix.

Case I.
When ∇H(y (k) ) > 0 or ∇H(y (k) ) < 0 but y i . For any nonzero vector z = (z 1 , z 2 , · · · , z p ) T ∈ R p , we have The last inequality is true due to the nonnegativity of X, and the data matrix A generally has practical significance, so the elements are usually nonnegative; therefore, we can obtain that the matrix X T A T AX is also a nonnegative matrix. Thus we obtain G(y, y (k) ) ≥ H(y).
Case II.
When ∇H(y (k) ) < 0 and δ > y we can also use the same technique to verify that matrix P − X T A T AX is positive semi-definite. According to Lemma 1, we obtain that H(y) is anon-increasing function so F(X, Y) is non-increasing when X is fixed. Therefore, we can obtain This completes the proof.
Similarly, we can use the same method to verify that when Y is fixed, the function F(X, Y) is also a non-increasing function. Thus, we have Consequently, by (20) and (21), we obtain Theorem 3. The sequence {(X (k) , Y (k) )} generated by Algorithm 1 converges to the stationary point of Problem 1.
Because of the continuity and monotonicity of the function F, we can obtain lim k≥0,k→∞ (X (k) , Y (k) ) = (X * , Y * ).
Now we will prove the point {(X * , Y * )} is the stationary point of Problem 1, that is, we will prove that {(X * , Y * )} satisfies the KKT conditions (5). We first prove and Based on the definition of Y ij in (14) ij } may have two convergent points Y * ij or σ. We set Furthermore, according to (25), we have (24). Now we begin to prove (25). If the result is not true, there exists (i, j) such that which is a contradiction of (26); hence, (25) holds. (24) and (25) imply that Y * satisfies the KKT conditions (5). In a similar way, we can prove that X * satisfies the KKT conditions (5). Hence, (X * , Y * ) is the stationary point of Problem 1.

Numerical Experiments
In this section, we first present a simple example to illustrate that Algorithm 1 is feasible to solve Problem 1, and we apply Algorithm 1 to unsupervised feature selection. We also compare our algorithm with the MaxVar Algorithm [22], the UDFS Algorithm [23] and the MFFS Algorithm [24]. All experiments were performed in MATLAB R2014a on a PC with an Intel Core i5 processor at 2.50 GHz with a precision of ε = 2.22 × 10 −16 . Set the gradient value Due to the KKT conditions (5) we know that if GV(X, Y) = 0 then (X, Y) is the stationary point of Problem 1. So we use either GV(X, Y) ≤ 1.0 × 10 −4 or the iteration step k has reached the upper limit 500 as the stopping criterion of Algorithm 1. This example shows that Algorithm 1 is feasible to solve Problem 1.

Dataset
In the next stage of our study, we used standard databases to test the performance of our proposed algorithm. We first describe the four datasets use, the target image is shown in Figure 1, and the characteristics of which are summarized in Table 1.  2021)): This dataset contains 20 objects. The images of each object were taken 5 degrees apart as the object was rotated on a turntable, and for each object there are 72 images. The size of each image is 32 × 32 pixels, with 256 grey levels per pixel. Thus, each image is represented by a 1024-dimensional vector.
2. PIE (http://archive.ics.uci.edu/ml/datasets.php (accessed on 1 December 2021)): This is a face image dataset with 53 different people; for each subject, 22 pictures were taken under different lighting conditions with different postures and expressions.
3. ORL (http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 1 December 2021)): This dataset contains ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).
4. Yale (http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 1 December 2021)): This dataset contains 165 grayscale images of 15 individuals. There are 11 images per subject, and they have different facial expressions (happy, sad, surprised, sleepy, normal and wink) or lighting conditions (center-light, left-light, right-light). Using the above four datasets, we input these grayscale images as the initial value A, and then used the initial matrix X and Y so that we could obtain a series of function values through the iterative updating of the algorithm. Then, we were able to obtain four convergence curves for different databases, as shown in Figure 2.  [22]: Selecting features according to the variance of features. The feature with the higher variance than others is more important.

MaxVar
2. UDFS [23]: l 2,1 -norm regularized discriminative feature selection method. This method selects the most distinctive feature through the local discrimination information of data and the correlation of features.
3. MFFS [24]: unsupervised feature selection via matrix factorization, in which the objective function originates from the distance between two subspaces.

Parameter Settings
In our proposed method, the parameter ρ was selected from the set {10, 10 2 · · · , 10 9 }. In the following experiments, we set the value of ρ to be 10 4 , 10 9 , 10 8 and 10 7 for the COIL20, PIE, ORL and Yale databases. The numbers of selected features were taken from {20, 40, 60, . . . , 200} for all datasets. Then, we computed the average value of ACC and NMI when selecting different numbers of features. The maximum iteration number (maxiter) was set to 1000. For the sparsity parameters γ and λ in UDFS, we set γ = λ = 10 −5 . The value of parameter ρ in MFFS was set as 10 8 . We set σ = δ = ε = 10 −4 in our proposed algorithm. The results of the comparison of MaxVar, UDFS, MFFS and our proposed algorithm are presented in Figures 3 and 4. The following two tables give the average accuracy and normalized mutual information calculated using our proposed algorithm.

Evaluation Metrics
There are two metrics to measure the results of clustering using the selected features. The accuracy of clustering (ACC) and normalized mutual information (NMI) can be calculated as follows. The value of ACC and NMI scales between 0 and 1, and a high value indicates an efficient clustering result. For every dataset, there are two parts, fea and gnd, the fea data are used to operate the selection, and after clustering one can obtain a clustering label, denoted by s i ; gnd is the true label of features denoted by t i , and the ACC can be computed using the clustering label and the true label.
The map(·) indicates a mapping that permutes the label of clustering result to match the true label as well as possible using the Kuhn-Munkres Algorithm [25]. For two variables P and Q, the NMI is defined in the following form:   Figure 4 shows the curves of the iteration step and the values of the objective function when Algorithm 1 was run on four datasets. We can see that the objective function value decreases as the iteration step increases. After a finite number of iterations, the objective function value reaches the minimum and tends to be stable.

Experiments Results and Analysis
In Tables 2 and 3, we report the best clustering accuracy and the best normalized mutual information, expressed as the number of selected feature changes. In Tables 2 and 3, we can see that the performance of Algorithm 1 was more effective than that of the MaxVar Algorithm [22], the UDFS Algorithm [23] and the MFFS Algorithm [24] on all data sets, which shows the effectiveness and robustness of our proposed method. In Figures 3 and 4, we present the clustering accuracy and the normalized mutual information, expressed as the number of selected feature changes. We can see that the performance of MFFS was slightly better than that of Algorithm 1 on COIL20. However, on the other three datasets, Algorithm 1 was relatively more efficient compared with the MaxVar Algorithm [22], the UDFS Algorithm [23] and the MFFS Algorithm [24], especially when the number of selected features was large.

Conclusions
The symmetric matrix optimization problem in the area of unsupervised feature selection is considered in this paper. By relaxing the orthogonal constraint, this problem is converted into a constrained symmetric nonnegative matrix optimization problem. An efficient algorithm was designed to solve this problem and its convergence theorem was also derived. Finally, a simple example was given to verify the feasibility of the new method. Some simulation experiments in unsupervised feature selection showed that our algorithm was more effective than the existing methods. Institutional Review Board Statement: Institutional review board approval of our school was obtained for this study.
Informed Consent Statement: Written informed consent was obtained from all the participants prior to the enrollment of this study.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The author declares no conflict of interest.