A Semi-Supervised Stacked Autoencoder Using the Pseudo Label for Classification Tasks

The efficiency and cognitive limitations of manual sample labeling result in a large number of unlabeled training samples in practical applications. Making full use of both labeled and unlabeled samples is the key to solving the semi-supervised problem. However, as a supervised algorithm, the stacked autoencoder (SAE) only considers labeled samples and is difficult to apply to semi-supervised problems. Thus, by introducing the pseudo-labeling method into the SAE, a novel pseudo label-based semi-supervised stacked autoencoder (PL-SSAE) is proposed to address the semi-supervised classification tasks. The PL-SSAE first utilizes the unsupervised pre-training on all samples by the autoencoder (AE) to initialize the network parameters. Then, by the iterative fine-tuning of the network parameters based on the labeled samples, the unlabeled samples are identified, and their pseudo labels are generated. Finally, the pseudo-labeled samples are used to construct the regularization term and fine-tune the network parameters to complete the training of the PL-SSAE. Different from the traditional SAE, the PL-SSAE requires all samples in pre-training and the unlabeled samples with pseudo labels in fine-tuning to fully exploit the feature and category information of the unlabeled samples. Empirical evaluations on various benchmark datasets show that the semi-supervised performance of the PL-SSAE is more competitive than that of the SAE, sparse stacked autoencoder (SSAE), semi-supervised stacked autoencoder (Semi-SAE) and semi-supervised stacked autoencoder (Semi-SSAE).


Introduction
Deep learning has been a focus of machine learning research since it was proposed by Hinton et al. [1].As a typical deep learning algorithm, the stacked autoencoder (SAE) [2] extracts hierarchical abstract features from samples by the autoencoder (AE), and then maps the abstract feature to the output by a classifier or regression algorithm.Compared with traditional neural networks, the multilayer structure of SAE represents a strong feature extraction capability, avoiding the limitations of traditional machine learning algorithms in manual feature selection [3].Meanwhile, the greedy layer-wise training of the SAE determines the network parameters layer by layer and accelerates the convergence speed [4].By virtue of excellent performance, the SAE has been applied to mechanical fault diagnosis [5,6], disease association prediction [7,8] and network intrusion detection [9,10].
The SAE has been extensively studied, and many methods of improvement have been introduced into the SAE.Vincent et al. [11] combined the SAE with the local denoising criterion and proposed the stacked denoising autoencoder (SDAE).Different from the SAE, the SDAE employs noise-corrupted samples to reconstruct noise-free samples, and it enhances the robustness of the abstract feature.To obtain a sparse feature representation, Ng et al. [12] integrated the sparsity constraint into the SAE and proposed the stacked sparse autoencoder (SSAE).The SSAE can reduce the activation of hidden nodes and use a few network nodes to extract representative abstract features.Masci et al. [13] Entropy 2023, 25, 1274 2 of 18 proposed the stacked convolutional autoencoder (SCAE) by replacing the fully connected layer with convolutional and pooling layers to preserve the spatial information of the training images.By introducing the attention mechanism into the SAE, Tang et al. [14] constructed the stacked attention autoencoder (SAAE) to improve the feature extraction capability.Tawfik et al. [15] utilized the SAE to extract unsupervised features and merge the multimodal medical image.In addition, many other methods [16][17][18][19] have been proposed for the development and application of the SAE.
However, the manual labeling of large numbers of samples is impossible due to the limited knowledge and efficiency.In many fields, such as speech emotion recognition [20], medical image classification [21] and remote sensing image detection [22], the unprocessed training samples are usually only partially labeled, while the majority of samples are unlabeled.The supervised learning of the SAE requires sample labels to train the network and is unable to exploit the feature and category information contained in unlabeled samples, making it difficult to improve its generalization performance for the semi-supervised classification tasks.To tackle this problem, some studies in recent years have combined the SAE with semi-supervised learning.For the classification of partially labeled network traffic samples, Aouedi et al. [23] proposed the semi-supervised stacked autoencoder (Semi-SAE) to realize a semi-supervised learning of the SAE.This method needs unsupervised feature extraction for all samples in the pre-training stage and fine-tuning of the network parameters based on the classification loss of the labeled samples.By introducing the sparsity criterion into the Semi-SAE, Xiao et al. [24] proposed the seme-supervised stacked sparse autoencoder (Semi-SSAE).The Kullback-Leibler (KL) divergence regularization term added to the loss function improves the sparsity of the network parameters, and the Semi-SSAE is applied to cancer prediction.These improved SAE algorithms use only part of the information from the unlabeled samples in the feature extraction stage and have a limited generalization performance for semi-supervised classification tasks.
The pseudo label [25] is a simple and efficient method for implementing semi-supervised learning.It utilizes labeled samples to predict the class of unlabeled samples and integrates labeled and pseudo-labeled samples to train the network.Semi-supervised learning methods based on the pseudo label have been gradually applied to automatic speech recognition [26] and image semantic segmentation [27].To overcome the limitations of the traditional supervised SAE and to improve the generalization performance, the pseudo label-based semi-supervised stacked autoencoder (PL-SSAE) is proposed by combining the SAE with the pseudo label.The PL-SSAE first stacks the AE to extract the feature information in all samples through layer-wise pre-training.Then, the supervised classification and iterative fine-tuning on the labeled samples are used for the class prediction of the unlabeled samples.Finally, the pseudo-label regularization term is constructed, and the labeled and pseudo-labeled samples are integrated to complete the training of the network.Different from the SAE and Semi-SAE, the PL-SSAE is able to exploit both feature information from unlabeled samples for feature extraction and category information for classification and fine-tuning, aiming to improve its semi-supervised learning performance.To the best of our knowledge, the PL-SSAE is the first attempt to introduce the pseudo label into the SAE, and it extends the implementation methods of the semi-supervised SAE.
The research contributions of this study can be summarized as follows: • A new semi-supervised SAE named the PL-SSAE is proposed.By integrating the pseudo label with the SAE, the pseudo labels of the unlabeled samples are generated and the category information in the unlabeled samples is effectively exploited to improve the generalization performance of the PL-SSAE.The experimental results on various benchmark datasets show that the semi-supervised classification performance of the PL-SSAE outperforms the SAE, SSAE, Semi-SAE and Semi-SSAE.

•
The pseudo-label regularization term is constructed.The pseudo-label regularization term represents the classification loss of the pseudo-labeled samples, and it is added to the loss function to control the loss balance between the labeled and pseudo-labeled samples and to prevent over-fitting.The rest of this study is organized as follows.In Section 2, a brief introduction to the AE and SAE is described.In Section 3, the network structure and training process of the proposed PL-SSAE are detailed.In Section 4, the evaluation implementation and results on benchmark datasets are presented.In Section 5, the conclusion of this study is summarized.

Autoencoder
The AE is an unsupervised algorithm and consists of an encoder and a decoder.The encoder maps the input to the abstract representation and the decoder maps the abstract representation to the output.The network structure of the AE is shown in Figure 1.
term represents the classification loss of the pseudo-labeled samples, and it is to the loss function to control the loss balance between the labeled and pse beled samples and to prevent over-fitting.
The rest of this study is organized as follows.In Section 2, a brief introductio AE and SAE is described.In Section 3, the network structure and training proces proposed PL-SSAE are detailed.In Section 4, the evaluation implementation and on benchmark datasets are presented.In Section 5, the conclusion of this study is s rized.

Autoencoder
The AE is an unsupervised algorithm and consists of an encoder and a decod encoder maps the input to the abstract representation and the decoder maps the a representation to the output.The network structure of the AE is shown in Figure ( ) where d W is the weights matrix between the hidden layer and the output layer a is the bias of the output layer.The AE requires an optimization algorithm to fine-t network parameters.The reconstruction error is minimized to learn representa stract features in the samples.The loss function of the AE is formulated as follows For the input samples X = {x i } N i=1 , the AE encodes the samples using linear mapping and a non-linear activation function: where W e is the weight matrix between the input layer and the hidden layer, b e is the bias of the hidden layer and g(•) is the activation function.The decoder completes the decoding of the abstract feature to receive the reconstructed samples: where W d is the weights matrix between the hidden layer and the output layer and b d is the bias of the output layer.The AE requires an optimization algorithm to fine-tune the network parameters.The reconstruction error is minimized to learn representative abstract features in the samples.The loss function of the AE is formulated as follows: (3)

Stacked Autoencoder
The SAE is a supervised algorithm and consists of a stacked AE and a classifier.The AE extracts the hierarchical feature layer by layer, and the classifier maps the final abstract feature to the output.The SAE usually has a symmetric structure and includes encoding and decoding.However, the decoding process is often removed, and the final feature representation is used for classification and regression tasks.Suppose the training samples are {X, Y} = {(x i , y i )} N i=1 , the network structure is d, L 1 , L 2 , . . ., L k , m and the activation function is g(•).The network structure of the SAE is shown in Figure 2.
The SAE is a supervised algorithm and consists of a stacked AE and a classifier.The AE extracts the hierarchical feature layer by layer, and the classifier maps the final abstract feature to the output.The SAE usually has a symmetric structure and includes encoding and decoding.However, the decoding process is often removed, and the final feature representation is used for classification and regression tasks.Suppose the training samples are { } { }

X Y x y
, the network structure is  In pre-training, the output H i of the ith hidden layer is the input of the AE, and the input weights W i+1 and bias b i + 1 of the ( ) hidden layer can be obtained by the trained AE. k H is the final extracted feature and it is used as the input of the classifier to compute the output weights W and complete the classification mapping.
In fine-tuning, the SAE requires the calculation of the classification error of the training samples and backpropagates the error to optimize the network parameters using the gradient descent algorithm.When the cross-entropy error is used, the network loss function of the SAE is expressed as follows:  In pre-training, the output H i of the ith hidden layer is the input of the AE, and the input weights W i+1 and bias b i+1 of the (i + 1)th hidden layer can be obtained by the trained AE.H k is the final extracted feature and it is used as the input of the classifier to compute the output weights W and complete the classification mapping.
In fine-tuning, the SAE requires the calculation of the classification error of the training samples and backpropagates the error to optimize the network parameters using the gradient descent algorithm.When the cross-entropy error is used, the network loss function of the SAE is expressed as follows: where ŷic is the predicted probability that the ith sample belongs to class c, and p ic is the sign function.If the true label of the ith sample is c, p ic = 1, otherwise p ic = 0.

Pseudo Label
Traditional SAE can extract abstract features from labeled samples and complete the prediction or classification.However, the supervised learning of the SAE is only applicable to labeled samples, and unlabeled samples in the training data cannot be effectively utilized.The SAE is unable to use the feature and category information from unlabeled samples, and this limits the generalization performance of the SAE for semi-supervised tasks.Therefore, the PL-SSAE aims to propose a new semi-supervised SAE by introducing the pseudo-labeling method into the SAE.The PL-SSAE uses unlabeled samples for feature extraction and classification by generating the pseudo label and adding a regularization loss.The PL-SSAE makes full use of the feature and category information contained in unlabeled samples to improve the generalization performance of the SAE for semi-supervised problems.Compared with the SAE, the innovations of the PL-SSAE are the pre-training of the unlabeled samples, the generation of the pseudo label, and the construction of the pseudo-label regularization.
As a new approach to semi-supervised learning, the pseudo-labeling method aims to employ the network trained on labeled samples to predict unlabeled samples.Based on the clustering hypothesis, the most probable results are utilized as the pseudo labels for the unlabeled samples and the network is retrained with the pseudo-labeled samples.Thus, the application of the pseudo-labeling method requires three steps.The first step is to train the network with labeled samples.The second step is to predict the class of unlabeled samples and generate the pseudo label.The third step is to retrain the network with labeled and pseudo-labeled samples.The pseudo-labeling method represents both the labeling of the unlabeled samples and the semi-supervised training process of the network.Compared with other semi-supervised learning methods, the pseudo-labeling method can effectively exploit the category information contained in the unlabeled samples and improve the semi-supervised prediction and classification performance.

Network Structure
According to the requirements of the pseudo-labeling method and the SAE, the PL-SSAE divides the first step of the pseudo-labeling method into unsupervised pre-training and supervised fine-tuning.Suppose that the labeled samples are {X l , Y l }, the unlabeled samples are X u , the network structure is d, L 1 , L 2 , . . ., L k , m and the activation function is g(•).The network framework of the PL-SSAE is shown in Figure 3.The PL-SSAE consists of four stages: unsupervised pre-training, supervised fine-tuning, pseudo-label generation and semi-supervised fine-tuning.
In the unsupervised pre-training, similar to the SAE, the PL-SSAE trains the AE to assign network parameters layer by layer.However, unlike the SAE, the PL-SSAE requires both labeled and unlabeled samples in pre-training to fully exploit the feature information contained in the unlabeled samples.Meanwhile, using all samples at this stage avoids the repeated pre-training of pseudo-labeled samples and reduces the computational complexity.The relationship between the output of the ith hidden layer and the output of the (i + 1)th hidden layer is expressed as follows: where W i and b i are the input weights and bias of the ith hidden layer, respectively, and X = {X l , X u } is the set of labeled and unlabeled samples.Through the greedy layer-wise pre-training, the PL-SSAE obtains the connection weights and biases of all hidden layers to achieve the assignment of network parameters.In the supervised fine-tuning, the PL-SSAE calculates the classification loss of the labeled samples and optimizes the parameters of the pre-trained network.For the labeled samples X l , the PL-SSAE obtains their predicted labels Ŷl through feature extraction and classification.The classification loss between the predicted labels and the true labels is calculated by Formula (4), and the connection weights {W i } k i=1 and bias {b i } k i=1 of each hidden layer are adjusted by the stochastic gradient descent algorithm to determine the mapping function.The mapping function from the samples to the labels is formulated as follows: In the pseudo-label generation, the PL-SSAE predicts the labels and determines the pseudo labels of the unlabeled samples with the supervised, trained network.For the unlabeled samples {x i } N i=1 , their prediction probabilities on different classes y j i m j=1 are calculated through forward propagation and label mapping.The label with the highest prediction probability is taken as the pseudo label of each unlabeled sample by the following formula: In semi-supervised fine-tuning, the PL-SSAE inputs the labeled samples {X l , Y l } and pseudo-labeled samples {X u , Y u } into the network and computes the classification loss to optimize the network parameters.Since the pseudo labels are not necessarily the true labels of the unlabeled samples, the PL-SSAE introduces a regularization parameter to keep the loss balance between the labeled and pseudo-labeled samples.Using the cross-entropy function as a measure of the classification loss, the classification loss of the labeled samples J l , the classification loss of the unlabeled samples J u and the total loss of the network J PL−SSAE are expressed as follows: where N l and N u are the number of the labeled and unlabeled samples, respectively, ŷl ic and ŷu ic are the prediction probabilities of the ith labeled sample and the jth unlabeled samples belonging to the class c. p l ic and p u jc are both the sign functions.If the true label of the ith labeled sample is c, p l ic = 1, otherwise p l ic = 0.If the pseudo label of the jth pseudo-labeled sample is c, p u jc = 1, otherwise p u jc = 0.The classification loss of the pseudo-labeled samples is the regularization term in the loss function to prevent over-fitting.By optimizing the network parameters, the network loss is gradually reduced, and the PL-SSAE classifies the labeled and pseudo-labeled samples more accurately.

Training Process
According to the network framework, the training process of the PL-SSAE consists of four stages: unsupervised pre-training, supervised fine-tuning, pseudo-label generation and semi-supervised fine-tuning.In the unsupervised pre-training stage, the network parameters are initialized by greedy layer-wise training on the labeled and unlabeled samples.In the supervised fine-tuning stage, the classification loss of the labeled samples is calculated, and the network parameters are optimized by the stochastic gradient descent algorithm.In the pseudo-label generation stage, the trained network predicts the class of the unlabeled samples and assigns pseudo labels to them.In the semi-supervised finetuning stage, the classification loss of the labeled and pseudo-labeled samples is computed to adjust the network parameters and complete the network training.Algorithm 1 presents the training details of the PL-SSAE.Input: The labeled samples {X l , Y l }, the unlabeled samples X u , the number of hidden nodes {L i } k i=1 , the regularization parameter λ, the number of mini-batch sizes s, the number of iteration t, learning rate α, and the activation function g(•) Output: The mapping function f : R d → R m .The unsupervised pre-training 1: Let X = {X l , X u } be the input and output of the first AE 4: else 5: Let H i−1 be the input and output of the ith AE 6: Randomly initialize the network parameters of the ith AE 7: for j = 1 to t do 8: Obtain mini-batch samples {x r } s r=1 from the input sample 9: Compute the hidden output H i of the AE by Equation (1) 10: Calculate the reconstructed samples {x r } s r=1 by Equation (2) 11: Compute the reconstruction loss of the AE by Equation (3) 12: Update the network parameters based on the stochastic gradient descent algorithm 13: end for 14: Assign the network parameters {W i , b i } of the ith AE to the ith hidden layer 15: Calculate the output H i of the ith hidden layer by Equation ( 5) 16: end for The supervised fine-tuning 17: Input the labeled samples {X l , Y l } into the network 18: for j = 1 to t do 19: Obtain mini-batch samples {x r } s r=1 from the input samples 20: Predict the labels { ŷr } s r=1 of the mini-batch samples by Equation ( 6) 21: Calculate the classification loss by Equation (4) 22: Update the network parameters {W i , b i } k i=1 and W based on the stochastic gradient descent algorithm 23: end for The pseudo-label generation 24: Input the unlabeled samples X u into the network 25: Compute the class prediction of the unlabeled samples by Equation (6) 26: Generate the pseudo labels Y u of the unlabeled samples by Equation ( 7) The semi-supervised fine-tuning 27: Input the labeled samples {X l , Y l } and the pseudo-labeled samples {X u , Y u } into the network 28: for j = 1 to t do 29: Obtain mini-batch samples {x r } s r=1 from the input samples 30: Compute the class prediction of the input samples by Equation (6) 31: Calculate the total classification loss J PL−SSAE by Equations ( 8)-(10) 32: Update the network parameters {W i , b i } k i=1 and W 33: end for 34:

Experiments
To verify the semi-supervised classification performance of the proposed PL-SSAE, the following evaluations were designed and carried out:  Various benchmark datasets used in the evaluations are Rectangles, Convex, USPS [28], MNIST [29] and Fashion-MNIST [30].The datasets are taken from the UCI Machine Learning Repository [31] and have been normalized to [0, 1].Details of the benchmark datasets are shown in Table 1.All evaluations were carried out in Pytorch 1.9, running on a desktop with a 3.6 GHz Intel 12,700 K CPU, Nvidia RTX 3090 graphics, 32 GB RAM and a 2 TB hard disk.To avoid the uncertainty and ensure a fair comparison, all reported results are the averages of 20 repeated experiments, and the same network structure is utilized for different algorithms.The network structure of each algorithm used in Experiments 2 and 3 is shown in Table 2.The experimental details of Experiment 1 are as follows: The dataset is MNIST, the batch size is 100, the number of iterations is 100, the learning rate is 0.01 and the activation function is a ReLU function.Suppose parameter p represents the percentage of labeled samples in the training data.When changing the regularization parameter and the percentage of labeled samples, the network structure is 784-300-200-100-10, the range of regularization parameter is λ ∈ {0, 0.1, 0.2, . . . ,1} and the range of label percentage is p ∈ {5, 10, 15, . . . ,50}.When changing the number of hidden nodes, the network structure is 784 − L 1 − L 2 − 10, the range of L 1 is L 1 ∈ {100, 200, . . . , 900,1000}, the range of L 2 is L 2 ∈ {100, 200, . . . , 900,1000}, the regularization parameter is λ = 0.5 and the label percentage is p = 20.
The experimental details for Experiment 2 are as follows: The datasets are Convex, USPS, MNIST and Fashion-MNSIT.The batch size is 100, the number of iterations is 100, the learning rate is 0.01, the activation function is a ReLU function and the range of the label percentage is p ∈ {5, 10, 15, . . . ,50}.The sparsity parameter of the SSAE and Semi-SSAE is ρ = 0.05, and the regularization parameter of the PL-SSAE is λ = 0.5.
The experimental details for Experiment 3 are as follows: The batch size is 100, the number of iterations is 100, the learning rate is 0.01, the activation function is the ReLU function, and the range of label percentage is p ∈ {5, 10, 15, 20}.The sparsity parameter of the SSAE and Semi-SSAE is ρ = 0.05 and the regularization parameter of the PL-SSAE is λ = 0.5.For multiclass classification tasks, the precision, F1-measure and G-mean are the averages of different classes.

Influence of Different Hyperparameters
As predetermined parameters of the network, the hyperparameters affect the semisupervised learning and classification performance of the PL-SSAE.The regularization parameter, the percentage of labeled samples and the number of hidden nodes are important hyperparameters for the PL-SSAE.The regularization parameter controls the balance between the empirical loss and the regularization loss.The percentage of labeled samples determines the number of labeled and pseudo-labeled samples.The number of hidden nodes controls the structural complexity and fitting ability of the network.To analyze the specific influence of different hyperparameters, a variable regularization parameter, a variable percentage of labeled samples and a variable number of hidden nodes are utilized to observe the accuracy change in the PL-SSAE.The generalization performance of the PL-SSAE with different regularization parameters and label percentages is shown in Figure 4.The generalization performance and training time of the PL-SSAE with different numbers of hidden nodes are shown in Figure 5.
The experimental details for Experiment 3 are as follows: The batch size is number of iterations is 100, the learning rate is 0.01, the activation function is th function, and the range of label percentage is

Influence of Different Hyperparameters
As predetermined parameters of the network, the hyperparameters affect th supervised learning and classification performance of the PL-SSAE.The regula parameter, the percentage of labeled samples and the number of hidden nodes portant hyperparameters for the PL-SSAE.The regularization parameter controls ance between the empirical loss and the regularization loss.The percentage of samples determines the number of labeled and pseudo-labeled samples.The nu hidden nodes controls the structural complexity and fitting ability of the network alyze the specific influence of different hyperparameters, a variable regularization eter, a variable percentage of labeled samples and a variable number of hidden no utilized to observe the accuracy change in the PL-SSAE.The generalization perfo of the PL-SSAE with different regularization parameters and label percentages is in Figure 4.The generalization performance and training time of the PL-SSAE with ent numbers of hidden nodes are shown in Figure 5.As is shown in Figure 4, the semi-supervised classification performance of the PL-SSAE varies with the regularization parameter and the percentage of labeled samples.
When the label percentage p is fixed, the classification accuracy of the PL-SSAE increases and then decreases as the regularization parameter λ increases.When the regularization As is shown in Figure 4, the semi-supervised classification performance of the PL-SSAE varies with the regularization parameter and the percentage of labeled samples.When the label percentage p is fixed, the classification accuracy of the PL-SSAE increases and then decreases as the regularization parameter λ increases.When the regularization parameter λ is fixed, the classification accuracy increases as the label percentage p increases.This is because the regularization parameter λ controls the importance of the pseudo-label loss in the loss function.A proper regularization parameter λ allows the PL-SSAE to exploit the feature and category information contained in unlabeled samples to improve its semi-supervised learning.However, an excessively large λ will cause the PL-SSAE to ignore the labeled samples, and the difference between the pseudo labels and the true labels will lead to an under-fitting.Therefore, it is important to choose appropriate regularization parameters for different samples.However, the trial-and-error method used for regularization parameter selection in the PL-SSAE is time-consuming and inefficient.Meanwhile, the labeled samples are the prior knowledge for the network.With the increase in label percentage p, the number of labeled samples in the training data increases and more category information improves the generalization performance of the network.
As is shown in Figure 5, the classification accuracy and training time of the PL-SSAE vary with the number of hidden nodes.As the number of hidden nodes increases, the generalization performance of the PL-SSAE increases and then decreases.The reason is that the hidden nodes control the function approximation ability of the network.As the number of hidden nodes increases, the generated pseudo labels are closer to the true labels and more category information contained in pseudo-labeled samples improves the semi-supervised learning of the PL-SSAE.However, too many hidden nodes will lead to the over-fitting of the network, and the difference between the training and testing samples will cause the classification accuracy to decrease.In addition, the training time of the PL-SSAE increases with the increase in hidden nodes.This is because the number of hidden nodes is positively correlated with the computational complexity of the network.When the computational power is fixed, the increase in the computational complexity leads to an increase in the training time.

Comparison of Semi-Supervised Classification
The semi-supervised classification performance is a direct reflection of the ability to learn from unlabeled training samples.To evaluate the semi-supervised classification performance of different algorithms, it is necessary to adopt different percentages of labeled samples, then record the accuracy change on the testing samples and plot the accuracy curves.The experiment in this section focuses on comparing the PL-SSAE with the SAE, SSAE, Semi-SAE and Semi-SSAE.The variation in classification accuracy of each algorithm on datasets with different label percentages is shown in Figure 6.
As shown in Figure 6, the semi-supervised classification performance of the PL-SSAE outperforms that of the SAE, SSAE and Semi-SAE and Semi-SSAE on different datasets.As the label percentage increases, the number of labeled training samples increases.Thus, more label information is exploited to learn the function mapping, and the generalization performance of each algorithm gradually increases.The classification accuracy of the PL-SSAE is higher than other algorithms at different label percentages.The reason is that the PL-SSAE is an effective semi-supervised algorithm.Compared with the supervised SAE and SSAE, the PL-SSAE uses the feature information and category information of the unlabeled samples to make the learned mapping function closer to the real mapping.Compared with the Semi-SAE and Semi-SSAE, the PL-SSAE not only utilizes the unlabeled samples for feature extraction but also exploits the pseudo-label information for classification mapping.The advantage of the PL-SSAE in the semi-supervised classification becomes more apparent when the percentage of labeled samples is small.However, when there are sufficient labeled samples, PL-SSAE tends to have no performance advantage and the inconsistency between pseudo labels and true labels will reduce the generalization performance of the PL-SSAE.
samples, then record the accuracy change on the testing samples and plot the accuracy curves.The experiment in this section focuses on comparing the PL-SSAE with the SAE, SSAE, Semi-SAE and Semi-SSAE.The variation in classification accuracy of each algorithm on datasets with different label percentages is shown in Figure 6.As shown in Figure 6, the semi-supervised classification performance of the PL-SSAE outperforms that of the SAE, SSAE and Semi-SAE and Semi-SSAE on different datasets.As the label percentage increases, the number of labeled training samples increases.Thus, more label information is exploited to learn the function mapping, and the generalization performance of each algorithm gradually increases.The classification accuracy of the PL-SSAE is higher than other algorithms at different label percentages.The reason is that the PL-SSAE is an effective semi-supervised algorithm.Compared with the supervised SAE and SSAE, the PL-SSAE uses the feature information and category information of the unlabeled samples to make the learned mapping function closer to the real mapping.Compared with the Semi-SAE and Semi-SSAE, the PL-SSAE not only utilizes the unlabeled samples for feature extraction but also exploits the pseudo-label information for classification mapping.The advantage of the PL-SSAE in the semi-supervised classification becomes more apparent when the percentage of labeled samples is small.However, when there are sufficient labeled samples, PL-SSAE tends to have no performance advantage and the inconsistency between pseudo labels and true labels will reduce the generalization performance of the PL-SSAE.

Comparison of Comprehensive Performance
To test the comprehensive performance of the PL-SSAE, all benchmark datasets mentioned above are used to compare the PL-SSAE with the SAE, SSAE, Semi-SAE and Semi-SSAE.Different metrics, such as accuracy, precision, F1-measure and G-mean, of each algorithm with different label percentages are recorded to evaluate the semi-supervised performance.The training and testing times of each algorithm are recorded to compare the computational complexity.The classification accuracy, precision, F1-measure, G-mean, training time and testing time of each algorithm are shown in Tables 3-8, respectively (the numbers in bold indicate the best results).Since the experimental results are the averages of repeated experiments, the standard deviation of the results is listed after the average to reflect the performance stability of the algorithm.As shown in Tables 3-6 the comprehensive performance of the PL-SSAE is better than that of the SAE, SSAE, Semi-SAE and Semi-SSAE.For each dataset, the PL-SSAE has higher classification accuracy, precision, F1-measure and G-mean than other algorithms with different label percentages.The reason is that the SAE and SSAE do not use unlabeled samples in the training process, and the Semi-SAE and Semi-SSAE only use unlabeled samples in the feature extraction process.The PL-SSAE introduces the pseudo label and makes appropriate use of the labeled samples to generate the pseudo labels of the unlabeled samples.The category information contained in the pseudo-labeled samples guides the feature extraction and class mapping of the network, and this improves the semi-supervised learning and classification performance of the PL-SSAE.Moreover, the PL-SSAE integrates the pseudo-label regularization into the loss function.The balance between the classification loss of the labeled and pseudo-labeled samples avoids over-fitting and improves the generalization performance.
As shown in Tables 7 and 8, the training time of PL-SSAE is slightly higher than that of the SAE, SSAE, Semi-SAE and Semi-SSAE, while the testing time of each algorithm is the same.The PL-SSAE requires additional fine-tuning of the pseudo-labeled samples.As a result, the computational complexity and training time of the PL-SSAE is twice that of the other algorithms.However, given the improvement in generalization performance, the increase in training time for the PL-SSAE is worthwhile.In the comparison of testing speed, the testing time is related to the sample size and network structure.Therefore, different algorithms with the same testing samples and network structure have the same testing speed.

Conclusions
To overcome the limitations of traditional SAE for unlabeled samples, this study integrates the pseudo label into the SAE and proposes a new semi-supervised SAE called PL-SSAE.The PL-SSAE assigns the pseudo labels to the unlabeled samples by the network trained on the labeled samples and adds a pseudo-label regularization term to the loss function.Different from the SAE, the PL-SSAE exploits the feature and category information contained in the unlabeled samples to guide the feature extraction and classification of the network.Various evaluations on different datasets show that the PL-SSAE outperforms the SAE, SSAE, Semi-SAE and Semi-SSAE.
However, the different hyperparameters of the PL-SSAE in this study are determined by the time-consuming trial-and-error method.Thus, it is important to combine the PL-SSAE with the particle swarm optimization algorithm [32] or the ant colony algorithm [33] to achieve automatic optimization of the hyperparameters.In addition, the PL-SSAE only determines the pseudo labels by taking the maximum value of the prediction probabilities.This method tends to introduce noise.Therefore, a more effective method needs to be investigated to further generate more reasonable pseudo labels.

Figure 1 .For
Figure 1.Network structure of the AE.

Figure 1 .
Figure 1.Network structure of the AE.

.
The network structure of the SAE is shown in Figure2.

Figure 2 .
Figure 2. Network structure of the SAE.
where ˆic y is the predicted probability that the ith sample belongs to class c , and ic p is the sign function.If the true label of the ith sample is c ,

Figure 2 .
Figure 2. Network structure of the SAE.The SAE needs pre-training and fine-tuning to train the network.The pre-training stage determines the initial network parameters and extracts abstract features through the greedy layer-wise training of the AE.The fine-tuning stage computes the classification error and optimizes the network parameters.In pre-training, the output H i of the ith hidden layer is the input of the AE, and the input weights W i+1 and bias b i+1 of the (i + 1)th hidden layer can be obtained by the trained AE.H k is the final extracted feature and it is used as the input of the classifier to compute the output weights W and complete the classification mapping.In fine-tuning, the SAE requires the calculation of the classification error of the training samples and backpropagates the error to optimize the network parameters using the gradient descent algorithm.When the cross-entropy error is used, the network loss function of the SAE is expressed as follows:

Algorithm 1 :
Training process of the PL-SSAE.

Experiment 1 : 18 Experiment 3 :
Influence of different hyperparameters.Observe the accuracy change in the PL-SSAE with a variable regularization parameter, variable percentage of labeled samples and variable number of hidden nodes, then analyze their influence on the classification performance of the PL-SSAE.Experiment 2: Comparison of semi-supervised classification.Record the classification accuracy of the SAE, SSAE, Semi-SAE, Semi-SSAE and PL-SSAE with different percentages of labeled samples and compare the semi-supervised learning capability of different algorithms.Entropy 2023, 25, 1274 9 of Comparison of comprehensive performance.Observe the accuracy, precision, F1-measure, G-mean, training time and testing time of the SAE, SSAE, Semi-SAE, Semi-SSAE and PL-SSAE to compare their generalization performance and computational complexity.
parameter of the PL-0.5 λ= .For multiclass classification tasks, the precision, F1-measure and G-mean averages of different classes.

Figure 4 .
Figure 4.The influence of the regularization parameter and label percentage on the gener performance.

Figure 4 .Figure 5 .
Figure 4.The influence of the regularization parameter and label percentage on the generalization performance.Entropy 2023, 23, x FOR PEER REVIEW 11 of 18

Figure 5 .
Figure 5.The influence of the hidden nodes on (a) accuracy and (b) training time.

Table 1 .
The datasets used in the experiments.
Results in bold are better than other algorithms.