Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression
Abstract
:1. Introduction
2. Related Works
2.1. Related Literature
2.2. Linear Regression Based on Least Squares Method
2.3. Ridge Regression, Lasso Regression, and L2,1 Norm
3. Robust Joint Sparse Regression Model
3.1. Establishment of Robust Joint Sparse Regression Model
3.2. The Solution of the Robust Joint Sparse Regression Model
Algorithm 1: Iterative process of solving matrix P |
Input: Data M, N, and P |
Output: P matrix |
1: Use the latest P to calculate D according to Equation (27); |
2: Use the latest D to update P according to Equation (30); |
3: Repeat steps 1 and 2 until convergence; |
4: Output the current P. |
Algorithm 2: Robust Joint Sparse Regression Algorithm |
Input: Training data , labels , parameter |
Output: Projection matrix and elastic variable |
1: Randomly initialize W and b; |
2: Calculate the diagonal matrix according to Equation (11); |
3: Use the values of , X, Y, and W to update Q according to Equation (19); |
4: Use the values of , X, and to update M according to Equation (22); |
5: Use the values of and Y to update N according to Equation (23); |
6: Use the values of W and Q to calculate P according to Equation (21); |
7: Use the values of P, M, and N to update P according to Algorithm 1; |
8: Update W with the latest P; |
9: Update b with the latest W; |
10: Repeat steps 2 to 9 until convergence. |
4. Results
4.1. Experiments on the JAFFE Dataset
4.2. Experiments on the CMU PIE Dataset
4.3. Experiments on the YaleB Dataset
4.4. Convergence Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 2, 185–205. [Google Scholar] [CrossRef] [PubMed]
- Beatriz, R.; Verónica, B. A review of feature selection methods in medical applications. Comput. Biol. Med. 2019, 122, 103375. [Google Scholar]
- Last, M.; Kandel, A.; Maimon, O. Information-theoretic algorithm for feature selection. Pattern Recognit. Lett. 2001, 22, 799–811. [Google Scholar] [CrossRef]
- Koller, D.; Sahami, M. Toward Optimal Feature Selection. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; Volume 9, pp. 284–292. [Google Scholar]
- Kira, K.; Rendell, L.A. A practical approach to feature selection. In Machine Learning Proceedings 1992; Sleeman, D., Edwards, P., Eds.; Morgan Kaufmann: San Mateo, CA, USA, 1992; pp. 249–256. [Google Scholar]
- Avrim, L.B.; Pat, W.L. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar]
- Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar]
- Gongmin, L.; Chenping, H.; Feiping, N.; Tingjin, L.; Dongyun, Y. Robust feature selection via simultaneous sapped norm and sparse regularizer minimization. Neurocomputing 2018, 283, 228–240. [Google Scholar]
- Dash, M.; Choi, K.; Scheuermann, P.; Liu, H. Feature Selection for Clustering—A Filter Solution. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 115–122. [Google Scholar]
- Huang, Q.; Tao, D.; Li, X.; Jin, L.; Wei, G. Exploiting Local Coherent Patterns for Unsupervised Feature Ranking. IEEE Trans. Syst. Man Cybern. 2011, 41, 1471–1482. [Google Scholar] [CrossRef] [PubMed]
- Kohavi, R.; John, G.H. Wrappers for Feature Subset Selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Hou, C.; Nie, F.; Yi, D.; Wu, Y. Feature Selection via Joint Embedding Learning and Sparse Regression. In Proceedings of the International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 1324–1329. [Google Scholar]
- Zhao, Z.; Wang, L.; Liu, H. Efficient Spectral Feature Selection with Minimum Redundancy. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 673–678. [Google Scholar]
- Meng, L. Embedded feature selection accounting for unknown data heterogeneity. Expert Syst. Appl. 2019, 119, 350–361. [Google Scholar]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Donoho, D.L. For most large underdetermined systems of linear equations the minimal L1-norm solution is also the sparsest solution. Commun. Pure Appl. Math. 2006, 59, 797–829. [Google Scholar] [CrossRef]
- Hui, Z.; Trevor, H. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 2005, 67, 301–320. [Google Scholar]
- Hou, C.; Jiao, Y.; Nie, F.; Luo, T.; Zhou, Z.H. 2D Feature Selection by Sparse Matrix Regression. IEEE Trans. Image Process. 2017, 26, 4255–4268. [Google Scholar] [CrossRef] [PubMed]
- Mo, D.; Lai, Z. Robust Jointly Sparse Regression with Generalized Orthogonal Learning for Image Feature Selection. Pattern Recognit. 2019, 93, 164–178. [Google Scholar] [CrossRef]
- Lemhadri, I.; Tibshirani, R.; Hastie, T. LassoNet: A neural network with feature sparsity. J. Mach. Learn. Res. 2021, 22, 5633–5661. [Google Scholar]
- Li, K.; Wang, F.; Yang, L.; Liu, R. Deep Feature Screening: Feature Selection for Ultra High-Dimensional Data via Deep Neural Networks. Neurocomputing 2023, 10, 142–149. [Google Scholar] [CrossRef]
- Li, K. Variable Selection for Nonlinear Cox Regression Model via Deep Learning. arXiv 2022, arXiv:2211.09287. [Google Scholar] [CrossRef]
- Chen, C.; Weiss, S.T.; Liu, Y.Y. Graph Convolutional Network-based Feature Selection for High-dimensional and Low-sample Size Data. Bioinformatics 2023, 39, btad135. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Cosman, P.C.; Rao, B.D. Robust Linear Regression via L0 Regularization. IEEE Trans. Signal Process. 2018, 66, 698–713. [Google Scholar]
- Ding, C.; Zhou, D.; He, X.; Zha, H. R1-PCA: Rotational invariant L1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 281–288. [Google Scholar]
- Nie, F.; Huang, H.; Cai, X.; Ding, C. Efficient and Robust Feature Selection via Joint L2, 1-Norms Minimization. Adv. Neural Inf. Process. Syst. 2010, 23, 1813–1821. [Google Scholar]
- Lai, Z.; Mo, D.; Wen, J.; Shen, L.; Wong, W.K. Generalized Robust Regression for Jointly Sparse Subspace Learning. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 756–772. [Google Scholar] [CrossRef]
- Lai, Z.; Xu, Y.; Yang, J.; Shen, L.; Zhang, D. Rotational Invariant Dimensionality Reduction Algorithms. IEEE Trans. Cybern. 2017, 47, 3733–3746. [Google Scholar] [CrossRef] [PubMed]
- Lai, Z.; Liu, N.; Shen, L.; Kong, H. Robust Locally Discriminant Analysis via Capped Norm. IEEE Access 2019, 7, 4641–4652. [Google Scholar] [CrossRef]
- Ye, Y.-F.; Shao, Y.-H.; Deng, N.-Y.; Li, C.-N.; Hua, X.-Y. Robust Lp-norm least squares support vector regression with feature selection. Appl. Math. Comput. 2017, 305, 32–52. [Google Scholar] [CrossRef]
- Xu, J.; Shen, Y.; Liu, P.; Xiao, L. Hyperspectral Image Classification Combining Kernel Sparse Multinomial Logistic Regression and TV-L1 Error Rejection. Acta Electron. Sin. 2018, 46, 175–184. [Google Scholar]
- Andersen, C.; Minjie, W.; Genevera, A. Sparse regression for extreme values. Electron. J. Statist. 2021, 15, 5995–6035. [Google Scholar]
- Hou, C.; Nie, F.; Li, X.; Yi, D.; Wu, Y. Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Trans. Cybern. 2014, 44, 793–804. [Google Scholar]
- Chen, X.; Lu, Y. Robust graph regularised sparse matrix regression for two-dimensional supervised feature selection. IET Image Process. 2020, 14, 1740–1749. [Google Scholar] [CrossRef]
- Lukman, A.F.; Kibria, B.M.G.; Nziku, C.K.; Amin, M.; Adewuyi, E.T.; Farghali, R. K-L Estimator: Dealing with Multicollinearity in the Logistic Regression Model. Mathematics 2023, 11, 340. [Google Scholar] [CrossRef]
- Golam, B.M.K.; Kristofer, M.; Ghazi, S. Performance of Some Logistic Ridge Regression Estimators. Comput. Econ. 2012, 40, 401–414. [Google Scholar]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
- Obozinski, G.; Taskar, B.; Jordan, M.I. Multi-Task Feature Selection; Technical Report; Department of Statistics, University of California: Berkeley, CA, USA, 2006. [Google Scholar]
- Huang, H.; Ding, C. Robust tensor factorization using R1 norm. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Yang, Y.; Shen, H.T.; Ma, Z.; Huang, Z.; Zhou, X. L2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 1589–1594. [Google Scholar]
- Zou, H.; Hastie, T.J.; Tibshirani, R. Sparse Principal Component Analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef]
- Scholkopf, B.; Smola, A.; Müller, K. Kernel Principal Component Analysis. In Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland, 8–10 October 1997; pp. 583–588. [Google Scholar]
- He, X.; Niyogi, P. Locality Preserving Projections. In Proceedings of the Advances in Neural Information Processing Systems 16 (NIPS 2003), Vancouver, BC, Canada, 8–13 December 2003; pp. 153–160. [Google Scholar]
- He, X.; Cai, D.; Niyogi, P. Laplacian Score for Feature Selection. In Proceedings of the Advances in Neural Information Processing Systems 18 (NIPS 2005), Vancouver, BC, Canada, 5–8 December 2005; pp. 507–514. [Google Scholar]
- Cai, D.; Zhang, C.; He, X. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, DC, USA, 25–28 July 2010; pp. 333–342. [Google Scholar]
- Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Lyons, M.; Kamachi, M.; Gyoba, J. The Japanese Female Facial Expression (JAFFE) Dataset [Data set]. Zenodo. 1998. Available online: http://www.kasrl.org/jaffe.html (accessed on 15 July 2023).
- Sim, T.; Baker, S.; Bsat, M. The CMU Pose, Illumination, and Expression Database. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 1615–1618. [Google Scholar]
- Georghiades, A.S.; Belhumeur, P.N.; Kriegman, D.J. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 643–660. [Google Scholar] [CrossRef]
Type of Problem | Reference | Purpose | Method | Comment |
---|---|---|---|---|
Feature extraction | [26] | To solve the problem of principal component analysis being sensitive to outliers when minimizing the sum of squared errors. | The R1-norm is proposed and used to reconstruct the PCA objective function. | R1-PCA effectively solves the problem that loss function based on the L2 norm measure is sensitive to outliers, but the method mainly carries out robust dimensionality reduction, that is, feature extraction. |
[29] | To solve the problem that the traditional subspace learning method based on the L2 norm metric is sensitive to noise. | The L2,1 norm is used to construct the objective function. | The rotation-invariant dimension reduction algorithm has strong robustness and rotation invariance, but it has high computational complexity and is sensitive to the size of the data set. | |
[30] | To solve the problem that traditional linear discriminant analysis is sensitive to noise, does not consider the local geometric structure of data, and the projection number is limited by the number of classes. | The L2,1 norm is used to construct the between-class scatter matrix and to apply joint sparsity to the projection matrix. The capped norm is used to further reduce the influence of outliers in the construction of the within-class scatter matrix. | This method is mainly used for feature extraction. | |
Feature selection | [27] | To solve the problem that the traditional feature selection method based on linear regression is sensitive to noise in data. | The L2,1 norm is used to construct the loss function and regularization so that the feature selection has joint sparsity. | This method is mainly used for selecting meaningful features from data in bioinformatics tasks, and there is still a potential risk of overfitting. |
[28] | To solve the problem that the number of projections in traditional ridge regression and its extension is limited by the number of categories and the robustness is poor. | The L2,1 norm constraint is applied to the loss function and regularization term to achieve joint sparsity. At the same time, the local geometric structure of the data is taken into account, and the robustness of the model is enhanced by introducing elastic factors to the loss function. | This method can perform robust image feature selection, but the computational cost is high and the case of unbalanced data is not considered. | |
[34] | To solve the problem that the traditional learning-based feature selection method does not consider both manifold learning and sparse regression. | The graph weight matrix is introduced to reveal the manifold structure between data points, and the joint sparsity of feature selection is achieved by introducing the L2,1 norm regularization. | This unsupervised feature selection framework combines the advantages of manifold learning and sparse regression, but there are some open problems in the selection of parameters. | |
[19] | To solve the problem of dimension disaster caused by converting two-dimensional images into one-dimensional vectors in traditional image feature selection methods. | It takes a matrix regression model to accept matrices as input and connects each matrix with its label. According to the intrinsic properties of regression coefficients, some sparsity constraints are designed for feature selection. | The regularization parameter selection of this method is complicated, but this method provides a new perspective for the study of feature selection | |
[35] | To solve the problem that the feature selection method of matrix regression ignores the local geometric structure of the data. | The intra-class compactness graph based on the manifold learning is used as the regularization item, and the L2,1-norm as loss functions to establish the matrix regression model. | This method can learn both left and right regression matrices while utilizing label information to preserve its inherent geometry. However, the method has significant limitations. For instance, noise can render the graph weight matrix invalid, and the method is prone to time-consuming parameter adjustment issues. |
Training Sample (%) | Ours | LASSO | LS | SPCA | JELSR | UDFS | LPP | ElasticNet | KPCA | RR | MCFS |
---|---|---|---|---|---|---|---|---|---|---|---|
40% | 71.86 ± 3.89 54 | 62.56 ± 6.34 82 | 63.49 ± 3.35 46 | 61.86 ± 6.55 50 | 62.56 ± 6.34 74 | 62.79 ± 7.07 50 | 61.16 ± 4.62 78 | 63.26 ± 1.56 74 | 61.86 ± 6.55 50 | 63.02 ± 1.91 74 | 61.40 ± 5.03 66 |
60% | 74.06 ± 3.92 50 | 66.25 ± 2.84 70 | 66.25 ± 3.60 62 | 67.19 ± 3.66 50 | 66.25 ± 5.25 70 | 65.63 ± 4.56 54 | 55.63 ± 7.46 58 | 66.88 ± 5.11 46 | 67.19 ± 3.66 50 | 66.56 ± 5.01 42 | 65.94 ± 6.85 78 |
80% | 77.67 ± 3.89 58 | 69.30 ± 5.55 70 | 67.91 ± 7.05 62 | 68.37 ± 7.09 50 | 69.30 ± 7.78 74 | 68.37 ± 5.84 66 | 63.72 ± 5.84 82 | 68.37 ± 4.23 50 | 68.37 ± 7.09 50 | 68.37 ± 4.23 50 | 68.37 ± 2.65 82 |
Block Size | Ours | LASSO | LS | SPCA | JELSR | UDFS | LPP | ElasticNet | KPCA | RR | MCFS |
---|---|---|---|---|---|---|---|---|---|---|---|
72.09 ± 3.68 70 | 49.30 ± 1.95 50 | 49.77 ± 4.82 74 | 47.91 ± 4.82 82 | 55.35 ± 8.29 78 | 55.81 ± 9.59 42 | 29.77 ± 16.39 74 | 46.05 ± 5.04 74 | 47.91 ± 4.82 82 | 46.51 ± 2.85 34 | 51.16 ± 1.64 78 | |
75.35 ± 5.35 50 | 53.02 ± 12.57 38 | 53.49 ± 10.00 78 | 53.95 ± 9.21 42 | 60.93 ± 9.21 38 | 62.79 ± 10.90 58 | 29.77 ± 18.12 78 | 54.88 ± 8.32 38 | 53.95 ± 9.21 42 | 55.35 ± 7.05 58 | 55.81 ± 9.73 70 | |
71.63 ± 1.95 78 | 43.84 ± 7.89 74 | 50.00 ± 6.45 82 | 47.90 ± 8.16 70 | 57.21 ± 6.90 66 | 58.61 ± 7.24 50 | 33.49 ± 14.20 78 | 46.98 ± 4.16 42 | 47.91 ± 8.16 70 | 46.98 ± 4.47 82 | 51.63 ± 6.66 82 |
Training Sample (%) | Ours | LASSO | LS | SPCA | JELSR | UDFS | LPP | ElasticNet | KPCA | RR | MCFS |
---|---|---|---|---|---|---|---|---|---|---|---|
40% | 65.50 ± 4.44 82 | 50.75 ± 9.54 82 | 41.17 ± 4.40 30 | 32.83 ± 3.60 82 | 32.75 ± 4.02 82 | 49.75 ± 2.22 38 | 51.08 ± 4.69 70 | 50.00 ± 4.44 82 | 32.83 ± 3.60 82 | 50.75 ± 3.83 78 | 50.50 ± 8.16 82 |
60% | 73.25 ± 4.41 78 | 56.63 ± 11.38 66 | 65.38 ± 2.28 26 | 36.75 ± 2.14 70 | 36.63 ± 2.92 62 | 57.25 ± 4.16 34 | 64.00 ± 2.85 78 | 60.13 ± 3.84 62 | 36.75 ± 2.14 70 | 61.00 ± 3.96 82 | 53.88 ± 9.43 82 |
80% | 85.25 ± 2.40 82 | 59.00 ± 11.16 34 | 54.75 ± 4.28 34 | 47.00 ± 4.81 82 | 46.50 ± 4.45 78 | 69.00 ± 10.58 42 | 74.75 ± 4.87 82 | 75.00 ± 3.19 70 | 47.00 ± 4.81 82 | 74.50 ± 4.20 78 | 61.75 ± 6.71 78 |
Training Sample (%) | Ours | LASSO | LS | SPCA | JELSR | UDFS | LPP | ElasticNet | KPCA | RR | MCFS |
---|---|---|---|---|---|---|---|---|---|---|---|
40% | 70.67 ± 4.81 82 | 60.42 ± 14.68 66 | 47.17 ± 11.30 14 | 36.75 ± 3.67 82 | 35.91 ± 4.27 82 | 61.42 ± 8.29 42 | 70.08 ± 5.45 82 | 64.83 ± 1.99 82 | 36.75 ± 3.67 82 | 66.25 ± 1.02 82 | 64.17 ± 6.99 82 |
60% | 79.00 ± 5.60 82 | 67.67 ± 17.18 62 | 42.63 ± 6.49 82 | 42.50 ± 6.43 78 | 42.54 ± 6.70 82 | 71.04 ± 5.14 38 | 78.17 ± 3.64 82 | 72.42 ± 3.35 82 | 42.50 ± 6.43 78 | 73.67 ± 2.62 82 | 67.25 ± 7.31 70 |
80% | 84.50 ± 3.38 78 | 70.50 ± 16.29 66 | 44.25 ± 6.59 74 | 44.50 ± 6.94 78 | 43.50 ± 7.09 82 | 71.00 ± 5.41 38 | 84.25 ± 3.26 82 | 79.75 ± 3.79 82 | 49.00 ± 5.48 82 | 81.50 ± 2.56 74 | 74.50 ± 1.68 78 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, L.; Zhu, D.; Liu, X.; Cui, P. Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression. Electronics 2023, 12, 4450. https://doi.org/10.3390/electronics12214450
Yang L, Zhu D, Liu X, Cui P. Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression. Electronics. 2023; 12(21):4450. https://doi.org/10.3390/electronics12214450
Chicago/Turabian StyleYang, Libo, Dawei Zhu, Xuemei Liu, and Pei Cui. 2023. "Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression" Electronics 12, no. 21: 4450. https://doi.org/10.3390/electronics12214450
APA StyleYang, L., Zhu, D., Liu, X., & Cui, P. (2023). Robust Feature Selection Method Based on Joint L2,1 Norm Minimization for Sparse Regression. Electronics, 12(21), 4450. https://doi.org/10.3390/electronics12214450