1. Introduction
Quantitative stock selection involves the identification of a suitable set of indicators for stock selection and the development of models and algorithms to select a portfolio of high-quality stocks for investment at the right time, resulting in stable and profitable returns. With high stability and wide coverage, the quantitative stock-picking strategy has become a hot spot for many quantitative investment strategies in academia and industry and has become one of the main investment methods in the field of financial investment. The early research on quantitative stock selection can be traced back to the 1950s when Markowitz’s [
1] ‘mean-variance’ model became a significant milestone in modern portfolio theory. Sharp et al. [
2] proposed the Capital Asset Pricing Model (CAPM) based on Markowitz’s theory. This model improves modern finance theory, but it is a one-factor model that focuses on the quantitative relationship between risky asset returns and market risk.
Subsequently, the multifactor stock-picking strategy has since emerged as one of many quantitative investment strategies with high stability and broad coverage that has drawn significant interest from the academic community and the business community. In multifactor modeling theory, the theory assumes that the excess return of an asset is driven by many factors, i.e., that the excess return of an asset can be explained. The multifactor stock-selection model first came from the Arbitrage Pricing Theory (APT) proposed by Ross [
3], which expands the one-dimensional linear model into a multivariate linear model and no longer uses the assumptions associated with the CAPM theory. Then, Fama et al. [
4] proposed the famous three-factor model, stating that the differences in stock returns could be explained by the market factor, size factor, and value factor. Due to the discovery of the momentum effect, Carhart [
5] added the momentum factor to the three-factor model proposed by Fama et al. [
4] and constructed a four-factor model, which obtained an explanatory power beyond that of the three-factor model. Fama et al. [
6] added the earnings factor and style factor to the previous model and proposed a five-factor model to further explain the excess returns of individual stocks. Stambaugh and Yuan [
7] added a corporate management factor and a stock price performance factor to explain the return on assets from a behavioral finance perspective. Assess [
8] found that companies with a good financial position show a favorable trend in terms of their stock price. Although they all consider multiple factors from different perspectives, the proliferation of various factors also poses technical challenges to traditional stock-selection methods, considering the variability and differences in financial markets: (1) In the factor-collection process, it is easy to introduce invalid variables that do not contribute anything to the response, resulting in a highly variable dimensionality
p, even larger than the sample size
n, and increased modeling difficulty. Fan and Lv [
9] proposed that since the parameters in high-dimensional regression models tend to be sparse, i.e., most of the coefficients are zero, (2) traditional factor-selection models, such as Portfolio Sorts and Fama–MacBeth, do not control the false discovery rate of factor selection.
To solve the first problem, many researchers have chosen to use variable-selection algorithms based on sparse regularity. It can distinguish invalid variables from valid ones, reduce the dimensionality of the variables, and improve the computational convenience and interpretability of the model. Most sparse-regularity-based variable-selection algorithms rely on linear or nonlinear additive regression models and introduce some kind of regularity term that can make the additive coefficients sparse on the optimization objective. By means of additive coefficients, such algorithms can give an estimate of the effective variables (Hastie & Tibshirani [
10]; Lin & Zhang [
11]; Chen et al. [
12]). The current classical variable-selection algorithms based on sparse regular terms are Lasso (Tibshirani [
13]), Group Lasso (Yuan and Lin [
14]; Bach [
15]), LassoNet (Lemhadri et al. [
16]), SpAM (Ravikumar et al. [
17]), and Elastic Net (Zou & Hastie [
18]). The regularization has good statistical theoretical properties and is robust as the regression coefficients of the factors are estimated at the same time that the factors are selected. Wang [
19] found that the Lasso model can effectively screen the indicators and greatly improve the model’s return and control of risk. Li et al. [
20] compressed the coefficients of 96 heterogeneous factors by using Lasso regression and Ridge regression. Shu and Li [
21] compared Logistic regression models with the Elastic Net, SCAD, and MCP penalty terms and found that they could improve the factor screening utility while guaranteeing gains. The Lasso model with the L1 penalty term constructed by Jagannathan and Ma [
22] was effective at eliminating invalid factors and obtaining excess gains. Zou and Hastie [
18] constructed Elastic Net to overcome the problem of multiple colinearities in high-dimensional data and also combined the advantages of the L1 and L2 penalty terms, which had a better effect on the screening features. And, it was shown that the model with the Elastic Net penalty term can filter the factors more effectively and also overcome the drawback of the Lasso model that overcompresses the coefficient matrix.
Although factor screening can be achieved based on sparse regularity, it is difficult to determine whether the selected factors are the correct ones with real value and explanatory power. In the context of big data in finance, the variability and timeliness of the data are very strong, and the results of factor selection may change constantly, which requires one to ensure the accuracy of the factor selection and control the FDR of the factor selection. The current conventional methods for controlling the FDR are mainly the Benjamini–Hochberg method (BHq) (Benjamini & Hochberg [
23]) and the Knockoff method (Barber & Candes [
24]; Candes et al. [
25]). Among them, the BHq method mainly uses p-values for FDR control, and most of the classical p-value calculation algorithms rely on large-sample asymptotic theory, so when the sample size is limited and the dimensionality is high, the p-values calculated based on classical algorithms may no longer be reliable (Candès et al. [
25]; Fan et al. [
26]). In addition, the BHq method is guaranteed to control the FDR to a given level only when the explanatory variables X are orthogonal (Barber & Candes [
24]), so the BHq method has strong assumptions and its use is limited.
To address the aforementioned problem, Barber proposed the Knockoff variable in 2015, combined it with a linear regression model to design an importance statistic (the Knockoff statistic), and provided a variable-selection algorithm based on this statistic. In recent years, the Knockoff method for the control variable-selection FDR has surpassed the BHq method in popularity due to its superior performance in both theory and practice. Numerous academics have studied variable structure and variable dimensionality in relation to Knockoff variable construction. For instance, in the study of the construction of Knockoff variables under a high-dimensional random design matrix X, Candes et al. [
25] performed statistical inference under a small sample. Gegout-Petit et al. [
27] constructed Knockoff variables by randomly arranging rows of
X, which is also applicable to the case of n < p. Variables were grouped by Katsevich and Sabatti [
28], who also proposed the Multilayer Knockoff Filter (MKF), a method that can improve variable selection while reducing the number of false positive gene findings. Some researchers combined feature screening with the Knockoff method. For example, Barber and Candes [
29] divided X into two groups: the first group performed the feature screening, and the second group created the Knockoff variables based on the features screened in the first group and performed an efficient inference. Liu et al. [
30] proposed the model-free feature screening method PC-Knockoff in a high-dimensional projection-related scenario. Some scholars have also used this method to solve some practical problems of variable selection. For example, Dai and Barber [
31] built a Group Lasso multitask regression containing the Knockoff method and applied it to identify drug resistance mutations in HIV-1, and the results showed that the model could better control the false discovery rate at the group level. Srinivasan et al. [
32] constructed a linear logit model based on Knockoff and applied it to inflammatory bowel disease and achieved good variable-selection results. Zhu and Zhao [
33] combined the Knockoff method with deep neural networks and applied it to the study of prostate cancer data, showing that the model was able to achieve accurate group FDR control. In summary, Knockoff has been much studied in both improving constructive variables and applying them to biomedicine, but in the financial field, less research has been conducted with it on factor selection. Taken together, the Knockoff method has the effect of controlling the FDR and improving the accuracy of feature selection in feature screening. In this paper, we consider incorporating the Knockoff method into the multifactor stock-selection model to control the FDR of factor selection and improve the predictive ability and robustness of the model.
Model selection is another important research element of multifactor stock-selection models, and existing studies can be divided into statistical models and machine learning models. Predicting the exact return is more difficult, but predicting its classification is relatively easy. The most representative statistical model is Logistic regression. In the classification problem, Logistic regression as a statistical analysis method can effectively discriminate the classification. However, it does not perform as well with high-dimensional data. Therefore, to improve the classification performance of the Logistic regression model, the Elastic Net penalty term is considered to be added to the Logistic regression. Machine learning models are able to train a large number of sample data, but the models lack interpretability. Statistical models compensate for this drawback and are widely chosen because of their good explanatory and predictive power.
In summary, this paper aims to improve the correct rate of factor selection and investment return by combining Elastic Net with Logistic regression, controlling the FDR of factor selection by using Knockoff, constructing an effective factor system, and then making predictions based on the selected factors with statistical models. The main contributions are as follows: (1) KF-LR-Elastic Net, a new factor-selection model, is built. The Knockoff variable is added to the Logistic regression model to control the FDR of factor identification to ensure the accuracy of factor selection, which provides a workable idea to improve the modeling effect of Logistic regression. (2) The Elastic Net regularization method and the Knockoff method are innovatively applied to the quantitative multifactor model at the same time, which can better balance the safety and accuracy of the factor selection. (3) A new effective factor system is constructed by taking the CSI 300 index of the Chinese stock market as the research object. Based on this, the effectiveness of the constructed KL-LREN-LR in multifactor quantitative stock selection is demonstrated from multiple perspectives by using Logistic regression forecasting, constructing stock strategies, and comparing their investment performance.
The rest of the article is organized as follows: The first section illustrates the factor-selection model, introduces the Elastic network, and constructs the factor-selection model LR-Elastic Net. In the second section, Knockoff is introduced and a new mode (KF-LR-Elastic Net) is built to achieve factor selection and prediction. The third section is an empirical analysis using the CSI 300 constituent stock data for a portfolio and the Chinese market as an example. The fourth section briefly summarizes the full paper and discusses the subsequent research.
4. Conclusions and Prospect
Based on the monthly data of the CSI 300 index constituents from January 2016 to December 2022, a portfolio model is constructed by using the KF-LR-Elastic Net variable-selection method and the Logistic-classification forecasting method. The effective factors with a relatively significant impact on the stock returns were selected from the original factor pool, and we controlled for the FDR; this was followed by forecasting whether the individual stock returns could outperform the CSI 300 index and constructing a portfolio for a historical trading backtest. The following conclusions are drawn:
First, the KF-LR-Elastic Net model is used to screen quantitative factor indicators with high explanatory power for returns. This method has a higher overlap with the other indicators selected by Elastic Net regression without the inclusion of Knockoff variables and Lasso-Logistic regression without Elastic Net regularization but tends to select fewer variables for the purpose of controlling false positives.
Second, Logistic regression is chosen as the forecasting model, and the forecasting model is built based on the selected factors. Logistic regression is used to predict stock returns, and it can construct a high-quality portfolio before and after the introduction of Knockoff, which can meet the needs of investors for proper risk avoidance. The results show that the Knockoff-based portfolios are more robust in terms of monthly returns alone. The model and strategy are more advantageous when there is a significant upward or downward trend in the stock market. By incorporating the Elastic Net rule, the Knockoff approach is able to construct a strategy with higher excess returns and lower risk.
Third, the relationship between quantitative factor indicators and stock returns is studied with an eye on the Chinese stock market. The application of Knockoff on multifactor stock selection was shown to have merit and can lead to a better investment performance, better serve investors’ decision making, and provide an intelligent investment-decision method.
In summary, this paper combines Knockoff, Elastic Net regularization, and Logistic regression. On the one hand, the method is significantly superior at selecting the factors that have a real and significant impact on obtaining the excess return of stocks among many candidate factors. On the other hand, the constructed KF-LREN-LR model can better explore the information about financial assets and select the assets with high returns, which is significant for stabilizing returns and controlling market risks. In addition, the methodology of this paper can be extended from multiple perspectives. In terms of models, Knockoff’s “counterfeit” idea can be extended to other models or methods, such as machine learning methods, including support vector machines, linear discriminant analysis, etc.; it can also be extended to methods of multisource data fusion, such as integration analysis, etc. In terms of variable-selection methods, the Lasso used in this paper can be replaced with other methods, especially when it comes to the variable-grouping structure. Variable selection can be achieved by using sparse-group MCP and CMCP penalties, etc. In terms of application, it can also be extended to practical problems such as credit default early warning and financial risk early warning, which are worth further exploration in subsequent research.