In this chapter, the method and process of feature selection, the architecture of the model, and the model’s processing of inputs are presented.
2.1. Feature Selection
The data used in this paper come from Cedar Capital and Tushare platform. Cedar Capital, founded in 2013, is an asset management company focusing on private securities investment business. Tushare is a financial data interface platform that provides API interfaces for all kinds of financial data, such as stocks, futures, funds, etc. The stock pool in this article is all normal stocks in the CSI All Share Index (CSI), except for those with insufficient listing time and ST special treatment.
In stock analysis, a multitude of optional data are included, which can be broadly classified into three primary categories: basic characteristics, technical indicators, and fundamental indicators. With regard to the basic characteristics, it includes data that reflect the basic trading conditions of the stock. Technical indicators are calculated based on data such as historical prices and volumes. These indicators are utilized for the analysis of stock price movements. Fundamental indicators focus on the financial condition and operating results of a company. Some of the features of the three categories are illustrated in
Table 1.
There is obviously a great deal of redundant information in so many different types of historical data, and it is evident that incorporating such superfluous information into a model constitutes a suboptimal approach. Therefore, when applying deep learning models, feature selection is required to extract key features from the noisy data, neither too much nor too little. This process reduces the number of variables, lowers the computational cost, reduces the occurrence of overfitting and improves the model performance [
20]. Htet Htet Htun et al. summarized the feature selection methods including filter, wrapper, and embedded methods, in which the filter method has the advantages of fast computation and robustness, but ignores the dependency relationship between features; the wrapper method can capture the complex relationship between features, but the computational cost is high and it is easily affected by overfitting; The embedded method combines the filter and wrapper methods, which is more efficient and less affected by overfitting [
19]. Singh J and Khushi M pointed out through their study that technical indicators are not sufficient for long-term stock forecasting and the combination of basic characteristics and technical indicator data can make stock forecasting more accurate [
21]. Therefore, this paper is based on the 15 kinds of stock data summarized by Zexin Hu et al. [
22] that had been proved to be relatively effective by scholars’ research under the basic features and technical indicators, and uses the Pearson correlation analysis to analyze the correlation between each feature and the daily return, as well as the interrelationships among the features, and applies the Information Gain method for comparative validation, so as to select the appropriate input features.
Data for the 15 stock metrics selected for analysis are shown in
Table 2, and abbreviations may be used in subsequent use.
Subsequently, the extended Pearson correlation coefficient and the correlation between the factors were calculated. The Pearson correlation coefficient is a statistical measure of the linear relationship between two variables. It is calculated using the following formula:
The market value weights are introduced to calculate the weighted correlation coefficient, where the weighted mean is expressed as follows:
The formula for weighted covariance is
The formula for weighted variance is
The weighted correlation coefficient is expressed as follows:
where the symbols are shown in
Table 3.
For each stock in the stock pool, the correlation coefficients between its daily return and each factor are calculated after removing the missing values. These coefficients are then summarized after assigning weights based on the market value of the stock to obtain the average weighted correlation coefficient. Pearson correlation analysis was then performed between the factors in turn to identify possible multicollinearity and simplify the model. After removing highly correlated features (greater than 80%) and features with a correlation to return of less than 10%, 8 features are selected: daily return, RSI, amt, vol, turnover, qfq, mkt, close.
Subsequently, Information Gain (IG) is used for comparative validation. IG is the change in class entropy from a previous state to a known state and can be used to compute the correlation of features [
23]. It further selects from the features obtained in the first step. For a feature X, the entropy(H) is
Among them,
is the probability that
X takes the value
. The higher the entropy, the higher the degree of disorder of the data; the lower the entropy, the more orderly the data. Subsequently, the conditional entropy, i.e., the entropy of the feature X when given the variable Y, is calculated with the following formula:
Finally, the Information Gain is obtained, which is used to measure the gain resulting from feature segmentation, with the following formula:
Among them, T is the target variable. A larger Information Gain indicates that the use of this feature for segmentation can significantly reduce the uncertainty of the system. Then, a subset of features is determined under the guidance of a set threshold value t.
Based on the Pearson correlation results, an IG model is constructed to calculate the correlation of each feature, and the top three features are obtained as follows: RSI, qfq, vol.
Combining the Pearson and IG results, as shown in the following equation:
is the original set of indicators,
is the highly correlated indicators,
n and
m are weights, and
and
are the Pearson correlation score and Information Gain score of features in the
set, respectively. For all the indicators in the
set, the correlations are calculated by equal weights, and sorted to obtain the final result. In summary, the feature selection method used in this paper is shown in
Figure 1, and the final features obtained are daily return, turnover, RSI, vol, and qfq.
The relevant algorithm is shown in Algorithm 1:
Algorithm 1 The Pearson and IG weighted selection pseudo-code. |
Input: |
Output: Final features |
1: | df.fillna(value=pd.NA, inplace=True) |
2: | for each stock do |
3: | |
4: | |
5: | end for |
6: | |
7: | |
8: | Final Features = np.sort(Fs + I(Fs) / n + R(Fs) / m) |
9: | return Final Features |
where qfq is similar to close, but qfq takes into account the decline in share price caused by factors such as corporate dividends on top of close. Most of the past research uses the original closing price as input consideration, but the compounded closing price has also been used by scholars in different scenarios and proved to be effective. For example, B. Xie et al. [
24] use the adjusted closing price, which expands the application of natural language processing technology in financial market analysis by introducing a semantic framework and an innovative SemTree feature space. B. Voon Wan Niu et al. use daily adjusted closing price data over a period of about two decades to determine whether the Malaysian stock market exhibits chaotic characteristics [
25]. Yumo Xu and Shay B. Cohen predicted stock moves by combining adjusted closing prices and text data from tweets [
26]. This is not explored in depth in this paper and the case may be due to the existence of a delimitation gap in the original close, which may affect the training of the model, whereas the use of qfq may make the timing pattern captured by the LSTM consistent with the real investment returns.
2.2. Construction of Models
The model constructed in this paper is trained with data from the past 480 days and predicts the weekly returns for the following week using 60 days of data, and constructs a stock selection strategy by classifying stocks into five categories based on their returns and calculating the probability that each stock belongs to each category during a position change cycle (i.e., one week), thereby categorizing the stocks into the five categories of Surge, Rise, Flat, Fall, and Plummet. More accurately, the labels from 4 to 0 correspond to “Surge,” “Rise,” “Flat,” “Fall,” and “Plummet,” respectively, which indicate sharp increase, increase, nearly unchanged, decrease, and sharp decrease in stock trends. The model calculates the strategy returns of the selected stocks for each position change cycle and updates the stock picking strategy for each position change cycle by outputting the code of the selected stock.
Firstly, the general framework of the CNN-LSTM-GNN neural network model proposed in this paper is explained. As demonstrated in
Figure 2, the CLGNN model consists of a combination of CNN module, LSTM module, and GNN module.
Modern CNNs were laid down by Yann LeCun et al. and successfully used for handwritten digit recognition [
27]. The convolutional algorithm that is activated by mish in this paper can be expressed as follows:
where
X is the input,
W is the weight,
K is the kernel size and
b is the bias. Subsequently, Dropout is applied to improve the generalization of the model.
LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber [
28], and its algorithm can be expressed as follows.
where
is the input to the input gate,
is the input at time step
t,
is the hidden state at time step
t,
is the cell state at time step
t,
is the output of the forget gate,
is the output of the output gate,
is the candidate cell state,
and tanh are the activation functions,
W is the weight matrix, and
b is the bias.
Inspired by multivariate time series graph neural network [
29], in this paper, MTGNN is modified as the basis of the GNN module as a module to analyze the potential connections between the data. The structure of the GNN module is illustrated in
Figure 3:
As illustrated in
Figure 3, the GNN module comprises a graph learning layer, n graph convolution blocks, n temporal convolution modules, and an output module. The graph learning layer is capable of computing and adapting the graph adjacency matrix in order to capture hidden relationships between time series data. In this paper, the assumption is made that the relationship between nodes is unidirectional, meaning that a change in one node affects some other node, rather than affecting each other in both directions. The nodes are embedded, and the adjacency matrix is ensured to be unidirectional by the following equation:
where
,
are the initial node embeddings,
,
are the model parameters, and
is the saturation rate of the activation function. When the value of
is positive, it indicates that the influence of the previous node on the subsequent node is stronger than that of the subsequent node on the previous node; when the value is negative, the ReLU will truncate it to 0.
The time and graph convolution modules are interleaved so as to capture both temporal and spatial dependencies. In the time convolution module (TC), residual connections and jump connections are used to avoid gradient vanishing, which consists of a null convolution and inception layer, and the convolution kernel is chosen as , , , to account for the natural temporal cycle.
In the graph convolution module, the information of nodes and neighbors is integrated through a process known as “mix-hop propagation,” which is comprised of two constituent components: information propagation and information selection. The specific formulation for this process is as follows:
K is the propagation depth, is the input hidden states outputted by the previous layer, is the hidden output states of the current layer. The proportion of the node’s own state retention is regulated by , and the neighbor information is propagated along the graph structure in a recursive manner; the important node features are filtered by the formula .
For the stock selection strategy optimization task, the model receives as input a three-dimensional tensor with dimensions expressed as (batch size, time step, feature dimension), i.e., each batch contains a batch size of stocks, each stock has time step days of data, and each day’s data have feature dimension individual features for each day. This input is processed through three distinct modules. The first module is the CNN module, which is composed of two convolutional blocks. The input passes through a 1D convolutional layer, a mish activation function layer, and a Dropout layer in sequence to generate the output of the CNN module. For the LSTM module, which contains a 2-layer LSTM network, a Dropout layer is applied between these two layers, and each LSTM unit processes the input data and transmits the hidden state to the subsequent layer, thereby contributing to the LSTM module output.
For the GNN module, the input first passes through the graph learning layer, and the sparse graph adjacency matrix is obtained based on the data computation, which is used as the input for the subsequent parts. The matrix first enters a starting convolutional layer and subsequently passes through three stacked convolutional blocks. The first convolutional block contains a TC module, a GC module, a residual convolutional layer, and a skip-joining convolutional layer. The inputs are first passed into the TC module and the copy of the input as residual. In the TC module, the input is passed into a filter dilation convolutional layer and gate dilation convolutional layer, respectively, using tanh activation function for filter and sigmoid activation function for gate to extract the feature information in the input data. Multiply the outputs of filter and gate and Dropout, and then pass into the skip-connected convolutional layer; in each convolutional block, the output of the skip-connected convolutional layer will be added with the previous skip-connected features, so that the features of different levels will be fused to avoid the loss of information in the process of passing. If the GC module has been enabled, the Skip is passed to the GC module concurrently, and the graph structure information is processed through the two mixprop graph convolution layers. Conversely, if the GC module has not been enabled, the ordinary residual convolution is performed on the output of the TC module. The output of the graph convolution or the residual convolution is summed up with the last dimension of the residual, thereby realizing the residual connection. Finally, it passes through the subsumption layer and is normalized. The output of the first block is passed through the same second block, and the resulting output is fed into the final convolution block, where all the skip-connect features are fused with the normalized output of the block. The fused features are then passed through the two ending convolution layers to obtain the output of the GNN.
The outputs of the three modules are spliced and subsequently passed through a 1D convolutional layer and mish activation function. This is then spread and passed through a linear layer to yield the final output.