1. Introduction
Many scholars have found that some universities are confronted with the problem that students cannot complete their studies within four years due to academic difficulties or subjective and objective factors [
1]. The academic early warning mechanism is an effective measure adopted to address the possibility of students being demoted or expelled after the credit system reform due to weak course foundations, insufficient self-control and other reasons [
2]. Therefore, against the background of the credit system in colleges and universities, how to help students prevent and overcome difficulties in the learning process and do a good job in academic early warning and assistance has become an urgent problem that colleges and universities need to solve.
At present, academic early warning work mainly falls into two aspects. One is institutional early warning, which is carried out based on certain triggering conditions, implementation processes, handling methods, etc. However, in actual implementation, there are problems such as the lack of significant pre-warning work, the difficulty in effectively integrating in-class and out-of-class collaboration, and the decision on the implementation of academic early warning is often made based on the results of course assessment. The second is academic performance prediction, which is the mainstream approach to studying students’ academic early warnings. At present, academic prediction mainly employs techniques such as statistical modeling, data mining, and comprehensive algorithms [
3,
4,
5] to conduct in-depth analysis of students’ historical grades, in order to discover the existing patterns, connections, and development trends, and thereby make academic predictions. This type of method aims to identify possible situations in students’ academic studies in advance through scientific data analysis means, providing a scientific basis for taking targeted support and intervention measures [
6,
7,
8,
9].
At present, the BP neural network model, SVM algorithm and other classification prediction models such as LGBM are mainly used to predict whether an academic warning will occur.
BP neural network is a kind of reverse transfer neural network and has good generalization ability. Yang proposed to use the GA-BP neural network model for prediction [
10], who selected the grades of a course as the original dataset and optimized the parameters of the BP neural network with the GA genetic algorithm. Ji proposed to apply the SVM-RFE algorithm for academic early warning prediction [
11]. The grades of five subjects were selected as the original dataset. SVM was used for prediction to extract the feature importance degrees. The least important features were deleted by the RFE algorithm, and the remaining features were predicted by SVM again. This process was repeated. In addition, Duong, Tran et al. also proposed using the LGBM model for academic early warning prediction [
12]. The LBGM prediction model is an integrated model based on decision trees and enhanced ensemble techniques, which has a relatively fast running speed.
However, when training BP neural networks, it is prone to getting stuck in the predicament of local optimal solutions, and it takes a long time to train. The BP neural networks are difficult to select the network structure, and are highly sensitive to data preprocessing. In complex cases with multi-disciplinary features, the accuracy of this method may not be particularly high. The SVM algorithm is not suitable for situations with large data samples [
13,
14], and it is rather sensitive to the selection of parameters and kernel functions [
15]. In the case where academic early warning is extended to all students in the future, the sample size will become very large, so the effect of the SVM algorithm model will become less than ideal. However, the LGBM model will have a relatively serious overfitting phenomenon in the case of complex data and high data noise [
16].
To address the shortcomings of the above models, this paper proposes a novel framework for XGBoost prediction models, namely the LTBoost model. The proposed model has two prominent innovations. Firstly, its framework is novel. It is the first time that the combination of the BiLSTM model, the Enhanced Transformer model and the SSA-XGBoost model has been applied to an academic early warning system. Secondly, in the Enhanced Transformer, not only do there exist random deactivation and weight decay mechanisms, but a novel global feature extraction method has been set. The process of this model is as follows. Firstly, the data is preprocessed. According to the difficulty level of the course, the data is divided into two groups and, respectively, input into the BiLSTM model for processing. The BiLSTM can capture the local bidirectional temporal information in the data through the parallel structure of the forward LSTM and the backward LSTM. Compared with the traditional RNN model, the memory module of BiLSTM is more suitable for feature processing of long-term sequential problems, and its bidirectional parallel structure is able to perceive local and global information compared to the traditional LSTM model. Then, the obtained results are input into the proposed Enhanced Transformer model, which perceives global information within the entire dataset. The enhanced features are obtained through the output result of the model’s encoding layer and combined with the original data features to improve the prediction accuracy. The combination of these two models can effectively handle the short-term and long-term information of feature data. Finally, we optimized the parameter selection of XGBoost through SSA to complete the prediction task. The combination of SSA-XGBoost can quickly find the training parameter values with the smallest error compared to the single XGBoost model, significantly improving the prediction efficiency. Based on the above work, the main contributions of this paper are summarized as follows:
- (1)
By using the BiLSTM module to process two sets of data, respectively, it can effectively handle strong time series data and improve the accuracy of subsequent predictions.
- (2)
By adopting the proposed Enhanced Transformer model, the enhanced features are added on the basis of the original features to increase data information and improve the prediction accuracy.
- (3)
The prediction model adopting SSA-XGBoost can simultaneously take into account the historical influence of both long-term and short-term data, reduce the risk of overfitting, and improve the accuracy of classification prediction.
2. LTBoost Model and Algorithm
At present, in the traditional binary classification prediction model for academic early warning, there are mainly the following three problems:
- (1)
The temporal relationship between the data is strong, and the traditional architecture makes it difficult to simultaneously connect the full text information above and below, which affects the prediction accuracy.
- (2)
The traditional architecture is difficult to take into account, both the historical influence of long-term data and short-term data simultaneously, resulting in a decrease in the accuracy of classification prediction.
- (3)
Under conditions of large data capacity and complex data, the prediction effect of traditional models is poor, and the prediction effect of the models is sensitive to the selection of parameters.
Therefore, the LTBoost model we proposed, which integrates the BiLSTM model and the Enhanced Transformer model with the SSA-XGBoost prediction model. This proposed model can effectively alleviate the above predicaments and improve the accuracy of the final academic early warning classification prediction. Next, we will discuss the principles of the three main models in the proposed LTBoost model framework.
2.1. BiLSTM Module
Observing the original data, it can be found that the original data scores are sorted by the semester time. Selecting the professional course scores at the end of each semester, the difficulty of the major increases with the passage of time. The professional scores in the later period are related to the proficiency in the earlier period; that is, they are related to the professional scores obtained in the earlier period. The overall dataset shows a strong temporal relationship. The LSTM model has a good performance for such data processing. However, the traditional LSTM model only relates to the above information at runtime, which may lead to the loss of information capture and affect the accuracy of the final prediction. This paper adopts a bidirectional LSTM model for the preliminary processing of data. When processing the data, it can simultaneously connect with the information of the context to improve the accuracy of prediction.
LSTM introduces three main gating mechanisms, namely the forget gate, the input gate, and the output gate. Each gate consists of a Sigmoid neural network layer and a pointwise multiplication operation. The Sigmoid layer has a value range, which is used to describe how much of each part can pass through. “0” represents that no quantity is allowed to pass through, and “1” represents that any quantity is allowed to pass through.
The forgetting gate is used to determine how many memory states there were at the previous moment, that is, how much information has been discarded.
where
and
represent the weight and bias parameter of the forget gate,
is the output of the forget gate,
is the input at the current moment, and
is the hidden state at the previous moment,
is the Sigmoid activation function.
The input gate is used to determine how much new information should be remembered at the current moment.
where
and
represent the weight and bias parameter of the input gate,
is the output of the input gate, and
is the candidate memory state,
and
represent the weight and bias parameter of the candidate memory.
Update the memory state based on the results of the forgetting gate and the input gate.
where
represents the memory state at the current moment,
represents the memory state at the previous moment, and
represents element-by-element multiplication.
The output gate controls how much information at the current moment should be output to the next moment.
where
and
represent the weight and bias parameter of the output gate,
is the output of the output gate, and
is the hidden state at the current moment.
The LSTM algorithm flow is shown in the following
Figure 1:
BiLSTM mainly consists of two independent LSTM layers running in parallel, shown as
Figure 2. During operation, one processes data from front to back, while the other processes data from back to front. At the end of running the two LSTM layers, the outputs are merged in a concatenated manner [
17].
In this paper, the original data is normalized first. The processing principle follows Equation (7), and the training set and test set are divided in a ratio of 7:3.
Among them, represents the normalized feature value, represents the feature value before normalization, represents the minimum value within the feature to which belongs (i.e., the same column in the dataset), and represents the maximum value within the feature to which belongs.
The features are divided into two groups according to professional courses and basic courses, which are and . Input these two sets of data into the BiLSTM model, respectively, and obtain two outputs with a feature number of 10. Combine them to obtain dataset with a feature number of 20. The temporal features extracted by BiLSTM are then fed into the Enhanced Transformer module described next.
2.2. The Proposed Enhanced Transformer Module
The proposed Enhanced Transformer model further conducts global perception of the context information, ultimately obtaining enhanced features, which are combined with the original features for prediction. This model could improve the accuracy of the results. The flow of the Enhanced Transformer model is shown in
Figure 3.
This study characterizes temporal features through positional encoding. The position encoding expression is
where
is a matrix of position encoding
, which represents a specific position.
is a sample length of the input data,
, which represents a specific dimension,
is an even position in the vector,
is an odd position in the vector, and the time series vectors obtained from the position encoding are merged with
to get
.
is the total dimensionality of the input sequence.
The Enhanced Transformer module utilizes the multi-head self-attention mechanism, which can avoid the attention from focusing too much on itself. Take
as input. Firstly, we need to construct the query matrix
, the key matrix
and the value matrix
. The empirical formulas can be used to obtain the weight
of each matrix for each channel.
,
,
of each channel are obtained according to the weights.
represents the labeling of the head. Then, the information matrix
of each header is obtained by the softmax algorithm, that is,
,
,
,
in
Figure 3. They are spliced together and multiplied with the output weight matrix to get the final output matrix
:
where
,
, and
are the number of corresponding matrix columns, respectively. Define
=
=
=
/
, and
is the output weight matrix. Concat can splice the matrices together, and
.
Then, in order to reduce the computational processing capacity of the Transformer model, a weight reduction algorithm and a Dropout layer were added before the linear layer in the decoding part of the traditional Transformer model. Eventually, by enhancing the encoding results of the Transformer module, 16 enhanced features were obtained based on the context information of the entire dataset. Compared with the original data, these results have more concise feature information and are beneficial for reducing the processing complexity in subsequent predictions. The original input data and the output matrix obtained by the Enhanced Transformer module are swapped with the row and column, and then stacked horizontally to obtain , which will be used as the input dataset for the subsequent SSA-XGBoost regression prediction.
In this study, the weight attenuation algorithm and Dropout layer are added to the decoding part of the Enhanced Transformer model, which can effectively reduce the complexity of the model, prevent overfitting of the model, and improve the generalization ability of the model.
2.3. The Proposed SSA-XGBoost Regression Prediction Module
The LTBoost model selects the SSA-XGBoost prediction module to conduct academic warning predictions on the processed student performance data. The XGBoost model is an optimized distributed gradient boosting model based on decision trees. Compared with traditional prediction models, it is more flexible and efficient. The regularization technique can control the complexity of the model and is more suitable for handling nonlinear systems. Furthermore, the XGBoost model can effectively handle high-dimensional time series features, and its predictions have good generalization and stability, with high prediction accuracy. The SSA algorithm can enhance the efficiency and stability when searching for the core parameters of the model, improving the prediction accuracy. The XGBoost model used in this paper for prediction has the significance of parameter selection optimization.
This paper adopts the SSA algorithm to optimize the parameter selection of the XGBoost model, the flowchart shown as
Figure 4. The SSA algorithm can enhance the efficiency and stability in finding the core parameters of the XGBoost model and improve the prediction accuracy.
First, initialize the parameters of the XGBoost model (the learning_rate
, the number of weak evaluators
, the proportion of samples taken during random sampling
, the maximum depth of the tree
and the value of the random sampling feature ratio
). Each piece of data in the parameter input matrix
is denoted as
, where
and
represents the
-th solution of the
-th parameter. There are a total of
parameters and
solutions for each parameter.
Its solution is updated after initialization:
where
is the each data point in the input matrix
,
is the upper position boundary.
is the lower position boundary, and
takes the value between 0 and 1.
Randomly generate the current optimal solution and update the value of the current optimal solution according to Equation (14).
where
is the
-th solution of the
-th parameter in the current iteration,
is the number of iteration steps,
is the maximum number of iterations,
, which is a random number,
, which is the warning value,
, which is the safety value,
is a random number that follows a normal distribution,
is the matrix of
, in which each element is 1.
Update the values of the solutions within the local range of the current optimal solution:
where
is the current optimal solution,
is the current global worst solution, and
is a matrix of
, in which each element is randomly assigned a value of 1 or −1.
Updating the values of local optimal solutions, solutions at the boundaries of the search space, and solutions with very poor fitness according to Equation (17) can effectively avoid errors in local optimal solutions.
where
is the current global optimal solution,
is the step size control parameter, which is a random number that follows the standard normal distribution,
, which is a random number,
is the fitness value of
,
is the fitness value of the current global optimal solution,
is the fitness value of the current global worst solution, and
is the minimum constant, avoiding the situation where the denominator is 0.
The optimal solution of the XGBoost model parameters is obtained through the SSA algorithm and then brought into the XGBoost model for prediction. Define the model as a
decision tree and the model’s predicted value at the
-th tree is
where
is the
-th sample,
is the total number of samples,
, and
is the decision function of the
-th tree.
The objective function of the
-th decision tree can be denoted as
where
is the loss function.
is the regularity term, which represents the model complexity of the
-th tree.
is the predicted value of the
-th sample
and
represents the predicted value of the previous
decision trees for the sample
.
Then, we performed a second-order Taylor expansion of the objective function and remove the constant term.
where
is the first-order derivative of the residual.
is the second-order derivative of the residual. We regularized expansion and removed the constant term to obtain:
where we define the regular term
as Equation (22):
Combine primary term coefficients and secondary term coefficients:
where
is the total number of leaf nodes.
represents the sample of the
-th leaf node.
is the sum of first-order partial derivatives contained in leaf node
.
is the sum of second-order partial derivatives contained in leaf node
.
is the score of leaf node
. The optimal solution for leaf node
is obtained by derivation of Equation (19).
The objective function is obtained by bringing
into Equation (21).
The objective function obtained through optimization and iteration can be used to optimally divide the decision tree. The tree that has been divided each time is taken as the next tree to be optimally divided. Finally, the leaf node scores of each decision tree are summed up to the final prediction value .
2.4. Algorithm Pseudocode and Complexity Analysis
To facilitate reproducibility, Algorithm 1 presents the complete training procedure of LTBoost.
| Algorithm 1: LTBoost Training Procedure |
Input: Historical course score matrix X, labels y Output: Trained LTBoost model 1. Normalize X to [0, 1] range 2. Split features into two groups: X_gen (general courses) and X_spec (specialized courses) 3. Split data into training (70%) and test (30%) sets temporally 4. // BiLSTM stage 5. h_gen ← BiLSTM(X_gen) // forward + backward LSTM 6. h_spec ← BiLSTM(X_spec) 7. h_temporal ← Concat(h_gen, h_spec) 8. // Enhanced Transformer stage 9. h_global ← EnhancedTransformer(h_temporal) // multi-head self-attention 10. X_aug ← Concat(h_temporal, h_global) // augmented feature set 11. // SSA-XGBoost stage 12. Initialize XGBoost parameters (lr, n_estimators, subsample, max_depth, colsample) 13. θ* ← SSA_optimize(XGBoost, X_aug, y_train) // sparrow search 14. model ← train_XGBoost(X_aug, y_train, θ*) 15. Return model |
Computational Complexity Analysis:
The overall time complexity of LTBoost is dominated by three modules:
BiLSTM: , where is sequence length, L is number of layers, is hidden dimension.
Enhanced Transformer: for self-attention, plus for feed-forward layers.
SSA-XGBoost: for tree building (: trees, : features, : samples), plus for SSA optimization (: iterations, : population size, : parameter dimension).
Overall complexity: . In practice, the SSA overhead is negligible ( ≤ 50, ≤ 30).
3. Experiment and Result Analysis on Academic Early Warning and Prediction
The original data for the academic warning prediction comes from the school’s academic administration system of the Aviation Engineering College of the Civil Aviation of China University for students of grades 18 and 19. The original data is messy and the courses selected by different students vary, making it difficult to extract logical information. Therefore, we only selected students from the same major and used their compulsory course grades as samples to make predictions about their academic situations. A dataset consisting of 19 features was obtained, and all the data grades were arranged in chronological order from the first year to the fourth year.
The first few columns of the data mostly contain the grades of basic courses such as “Mechanical Drawing”, “Theoretical Mechanics”, and “Mechanical Materials”. The latter columns represent the grades of more challenging specialized courses. The grades of the latter courses are partly related to the accumulation of knowledge from the former courses. Therefore, the grades of the former courses have a certain impact on the latter courses to some extent. Moreover, the academic situation is predicted based on the overall performance of the students in the following academic year. If the grades of the earlier courses are not satisfactory, the requirements for the grades in the later courses will be raised to avoid being warned. Thus, in such cases, the timing issue of the courses should also be taken into account when making predictions. The complex model proposed in this paper has well considered this point. And the warning values of 0 or 1 in this paper, when , indicate that the student has received an academic warning, or when , indicate that the student has not received an academic warning (0). The dataset consists of 518 samples. The ratio of samples labeled as “requiring alert” to those labeled as “not requiring alert” is 8:251.
Then, we normalized the academic early warning prediction data, which alleviated the related problems of gradient descent optimization being hindered due to significant differences in the range of eigenvalues. This could reduce the model’s sensitivity to specific feature scales and enhance its generalization ability on different data scales [
18]. Then, the training set and the test set were divided in a ratio of 7:3, and then into two datasets according to the compulsory professional courses and the general basic courses.
The parameters in the SSA-XGBoost module mainly include the learning_rate
, the number of weak evaluators
, the proportion of samples taken during random sampling
, the maximum depth of the tree
and the value of the random sampling feature ratio
. According to past experience articles [
19], the values of
and
are in the range of 0.5 to 1, and
is generally in the range of 1 to 10. Then, we refer to similar experience articles on predicting nonlinear data [
20] to get the optimal value with the help of the SSA algorithm.
Ultimately, we optimized the XGBoost model using the SSA algorithm and obtained the following parameter value results (
Table 1).
This paper employs LSTM, CNN, XGBoost, and the LSTM + CNN model for a comparative experiment. For other machine learning network parameters, the best parameter value was selected by comparing the error obtained by bringing in different values. This paper selects four evaluation models, which are Mean Absolute Error (MAE), Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). The equations of the four evaluation models are, respectively, shown as follows.
where
is the predicted value of carbon emissions,
is the true value of carbon emissions from the machine tool, and
is the number of samples in the test set.
The errors of each evaluation index obtained during the optimization process are shown in the following table.
It can be seen from the above three tables,
Table 2,
Table 3 and
Table 4, that when colsample_bytree takes 0.4, the MSE is reduced by 4.2% and 8.6% compared with 0.6, 0.5 and 0.8. The MAE was reduced by 1.9%. The MAPE was reduced by 0.1%. When the subsample was taken as 0.8, the error of MSE was reduced by 5.9%, 14.8% and 4% compared with those taken as 0.5, 0.7 and 0.9. The MAE was reduced by 5.4%, 7.0% and 1.9%, respectively. The MAPE was reduced by 0.1%. When max_depth was set at 7, the MSE was reduced by 4.6%, 1.9% and 24.6%, respectively, compared with when it was set at 5, 3 and 1. The MAEs were reduced by 5.4%, 10.3% and 34.0%, respectively. Therefore, we take colsample_bytree as 0.4, subsample as 0.8, and max_depth as 7.
For LSTM models, the purpose of optimizing the model is mainly achieved by adjusting its hidden layer functions. The results of the model optimization process are shown in the following table.
Obviously, according to the results of
Table 5, when the number of hidden layers is four, compared with six, five and eight, the MSE is reduced by 4.5% and 5.2%, respectively. The RMSE has been reduced by 2.6%. The MAE was reduced by 1.9%. Therefore, we set the number of hidden layers to four. Finally, we conducted hyperparameter optimization on the CNN model. This paper studied three widely used excitation functions, including Sigmoid, Tanh, and ReLU. The results predicted by the model under the three excitation functions are shown in the following figure.
Based on the results of
Figure 5, we select the ReLU function as the excitation function of the CNN model. Based on the above conclusions, we can determine the values of all the hyperparameters for model optimization. In the CNN + LSTM model, the MSE obtained by using the ReLU excitation function is the smallest.
Set the training steps of the five classification prediction models to 150 times. Divide all the samples in the dataset into rows and divide them into the training set and test set in a 7:3 ratio for the experiment. As a binary classification prediction problem, in this paper, the situation where the prediction result is less than 0.5 is defined as 0; that is, no academic warning occurs. Otherwise, it is defined as 1; that is, an academic warning occurs.
To verify the effectiveness of the BiLSTM module, the Enhanced Transformer model and SSA-XGBoost, this study conducted ablation experiments.
Table 6 shows that the MSE of the LTBoost proposed in this paper is 27.9% lower than that of BiLSTM, 7.7% lower than that of BiLSTM + SSA-XGBoost, and 4.8% lower than that of the Enhanced Transformer + SSA-XGBoost. The MAE of LTBoost is 41.3% lower than that of BiLSTM, 36% lower than that of BiLSTM + SSA-XGBoost, and 12.8% lower than that of the Enhanced Transformer + SSA-XGBoost. The RMSE of LTBoost is 13% lower than that of BiLSTM, 3.7% lower than that of BiLSTM + SSA-XGBoost, and 2.4% lower than that of the Enhanced Transformer + SSA-XGBoost. The MAPE of LTBoost is 29.2% lower than that of BiLSTM, 11% lower than that of BiLSTM + SSA-XGBoost, and 1.1% lower than that of the Enhanced Transformer + SSA-XGBoost.
This article compares LTBoost with the commonly used CNN, LSTM, XGBoost and CNN + LSTM, and the results are shown in the table below:
Table 7 shows that the MSE of the LTBoost proposed in this paper is 30.8%, 28.8%, 27.9%, and 29.8% lower than that of CNN, LSTM, XGBoost, and CNN + LSTM, respectively; the MAE of LTBoost is 39.1%, 38.5%, 38.5%, and 40.2% lower than that of CNN, LSTM, XGBoost, and CNN + LSTM, respectively; the RMSE of LTBoost is 14.4%, 13.3%, 13.1%, and 30.7% lower than that of CNN, LSTM, XGBoost, and CNN + LSTM, respectively; and the MAPE of LTBoost is 42.4%, 34%, 30.7%, and 40.4% lower than that of CNN, LSTM, XGBoost, and CNN + LSTM, respectively.
The accuracy rate is adopted as the evaluation model, as shown in Equation (31). To present the prediction situation more intuitively, this paper also incorporates a confusion matrix as an evaluation index.
where
represents the number of the actual value is 1 and the predicted value is also 1,
represents the number of the actual value is 0 and the predicted value is also 0, and
represents the number of samples in the test set.
The confusion matrices of LTBoost and the four comparison models are shown in the following figure:
It can be seen from the confusion matrices in
Figure 6 and
Figure 7 that LTBoost only predicted one sample originally “1” as “0”, while CNN, LSTM and CNN + LSTM all predicted two samples originally “1” as “0”, and the XGBoost model predicted a sample originally “0” as “1”. Predict a sample that was originally “1” as “0”. It can be seen from
Table 8 that the accuracy of the LTBoost model has increased by 0.7% compared with the other four comparison models. From this, it can be seen that LTBoost has the best prediction effect.
In order to make the experimental results more convincing, we also incorporated 5-fold cross-validation in the experiment.
As can be seen from
Table 9, after 5-fold validation, the average RMSE value of the LTBoost model decreased by 12.06%, 12.06%, 9.05% and 14.73%, respectively, compared to the CNN model, LSTM model, XGBoost model and CNN + LSTM model. From this, it can also be proven that LTBoost has the best prediction effect.
The data of academic early warning has a strong temporal correlation before and after. Just as the data features of the earlier time series are often general education courses, the values of these features may affect the feature values of the professional courses that are later in the time series. As can be seen from the prediction results shown above, for LSTM, CNN and CNN + LSTM models, it is impossible to capture the context information simultaneously, and the prediction effect is not very good in comparison, indicating that the underfitting phenomenon in these predictions has occurred. The XGBoost model may still have insufficient fitting due to the fact that the data captured by the model from a long time ago has little connection with the data prediction at this moment.
The LTBoost model incorporates the BiLSTM module and the Enhanced Transformer module. By connecting the BiLSTM module with context information to extract temporal patterns, the prediction accuracy is improved. Additionally, the Enhanced Transformer module is integrated into it to increase the feature information of the prediction data and the prediction feature conditions, making the prediction results closer to the true values. Further improve the accuracy of predictions.