The Image Classiﬁcation Method with CNN-XGBoost Model Based on Adaptive Particle Swarm Optimization

: CNN is particularly effective in extracting spatial features. However, the single-layer classiﬁer constructed by activation function in CNN is easily interfered by image noise, resulting in reduced classiﬁcation accuracy. To solve the problem, the advanced ensemble model XGBoost is used to overcome the deﬁciency of a single classiﬁer to classify image features. To further distinguish the extracted image features, a CNN-XGBoost image classiﬁcation model optimized by APSO is proposed, where APSO optimizes the hyper-parameters on the overall architecture to promote the fusion of the two-stage model. The model is mainly composed of two parts: feature extractor CNN, which is used to automatically extract spatial features from images; feature classiﬁer XGBoost is applied to classify features extracted after convolution. In the process of parameter optimization, to overcome the shortcoming that traditional PSO algorithm easily falls into a local optimal, the improved APSO guide the particles to search for optimization in space by two different strategies, which improves the diversity of particle population and prevents the algorithm from becoming trapped in local optima. The results on the image set show that the proposed model gets better results in image classiﬁcation. Moreover, the APSO-XGBoost model performs well on the credit data, which indicates that the model has a good ability of credit scoring.


Introduction
Image classification which belongs to the main research content of image processing has a broad application prospect in many sciences, such as object recognition, content understanding and image matching.Support vector machine (SVM) [1], k-nearest neighbor (KNN) [2] and decision tree (DT) [3] are all typical machine learning methods applied in this field.These studies prove the effectiveness and reliability of machine learning applied in image classification.In essence, process of the image classification is regarded abstractly as the composition of feature extraction and feature classification: First, the model extracts significant features and helps the latter classifier distinguish features better.Second, the classifier accepts extracted features and identifies them effectively.Feature extraction is an important part of image classification system.In image classification task, extracted feature quality directly affects performance of classification.The previous classifications did not fully extract information from image feature until neural network (NN) was applied to image classification, and image classification quickly became an important research direction in this field.Theoretically, NN can approximate any complex function and effectively solve the problem of image feature extraction.Except for image classification, neural networks have made continuous breakthroughs in target detection [4,5], face recognition [6,7] and other fields.Among them, CNN is an efficient neural network learning model, whose convolution kernel in the convolutional layer plays an important role in the extraction of features.The features of images are extracted automatically by convolution, and hierarchical structure of CNN can learn high-quality features at each layer.Although CNN is considered as the one of most powerful and effective feature extraction mechanism, the traditional classifier lay of CNN cannot fully grasp the information of feature extracted, which as single classifier cannot perform well in the face of diverse and complex data features.On the basis of the "No Free Lunch" theorem [8], for different structures and characteristics of changeable data, the prediction accuracy is greatly limited by a single classifier.Ensemble learning combines multiple classifiers that process different hypotheses to construct a better hypothesis and obtain excellent predictions.Dietterich [9] explained three basic reasons for the success of ensemble learning from three mathematical perspective: statistics, calculation and representativeness.In addition, the bias variance decomposition analyzes the effectiveness of ensemble learning [10].Kearns and Valiant [11] showed that weak classifiers can generate high precision estimates by integrating, as long as data is sufficient.These studies proved that ensemble learning has a better learning ability than a single classifier.Furthermore, Chen [12] proposed an advanced gradient boosting algorithm, the extreme gradient boosting tree (XGBoost), that has obtained good results in Kaggle data competitions.XGBoost has been widely used in image classification [7,13] and has good performance.Ren et al. [14] proposed an image classification method based on CNN and XGBoost.In this model, CNN is used to obtain features from the input, and XGBoost as a recognizer produces results to provide more accurate output.The experimental results on MNIST and CIFAR-10 show that the performance of this method is better than other methods, which verifies the effectiveness of the combination of CNN and XGBoost in image classification.
Good performance of models depends on the proper hyper-parameter settings.The hyper-parameters directly affect the structure of models and the performance of the model, so it is particularly important to tune the hyper-parameters appropriately.Generally, models rely on artificial experience tuning, which consumes a great deal of time and computing resources.Hyper-parameter optimization has been used to tune hyper-parameters to overcome the shortage of manual tuning.Most optimization of hyper-parameters performs in a continuous search space.Particle swarm optimization (PSO), originally proposed by Kennedy and Eberhart [15], is a computational intelligence technique.The original PSO algorithm was mainly designed for the optimization of a continuous space owing to the quantities describing the particle state and its motion laws being continuous real numbers.Song and Rama [16] proposed a XGBoost model combining the improved PSO algorithm to determine the relationship between tensile strength and plasticity and their influencing factors.The experimental results prove the effectiveness and reliability of the method.Le et al. [17] proposed a building thermal load forecasting and control model PSO-XGBoost.PSO optimizes the XGBoost model as predictor.The experimental results show that the proposed model is the most robust method for comparing the average absolute percentage error (MAPE), variance analysis (VAF) and other indicators of other models (XGBoost, SVM, RF, GP and CART) on the survey data of buildings.These studies prove the effectiveness of PSO to improve the performance of XGBoost learning algorithm.Therefore, PSO is more suitable for the hyper-parameter optimization.PSO finds the optimal solution through iteration, and it has a fast convergence speed.However, its disadvantage is that the states of the particles fall into a local optimum easily, thereby causing premature convergence.In response to this problem, we purpose the adaptive PSO (APSO).It uses the idea of clustering to adaptively divide the particle swarm into different populations and guide the populations by applying different update strategies.This enhances the diversity of particles and helps particles jump out of a local optimum.APSO is more suitable for the parameter optimization, and it improves the model prediction accuracy.
Based on the above, we propose a CNN-XGBoost based on APSO optimization for image classification.CNN is used as a feature extractor to automatically obtain features from the input, and the feature recognizer XGBoost receives the image features and then produces results; the parameter optimizer APSO is applied to optimize the structure of model to match feature, so the model gets accurate results.
The contributions of this paper are as follows: Firstly, a novel two-stage fusion image classification CNN-XGBoost based on APSO is proposed.It both ensures CNN can extract image features fully and makes use of XGBoost to distinguish features effectively, so as to ensure high accuracy of image classification as a whole.
Secondly, bidirectional optimization structure is adopted, both CNN and XGBoost are optimized by APSO at the same time.For one thing, optimizing the CNN to extract deep features, so that the extracted features are more suitable for the decision trees XGBoost, and for another, optimizing XGBoost makes the structure of the model match the extracted features, so as to better understand the image features.Bidirectional optimization maintains the characteristics of the two parts themselves meanwhile allowing the two parts to combine more closely together, making the features of the image fully extracted to be used for classification.
Thirdly, the PSO algorithm is improved based on adaptive subgroup division.Two different learning strategies are adopted to update different types of particles, enhance the diversity of particle population and avoid the algorithm falling into local optimal which improves adaptive processing capability of model for image features and increases accuracy of classification.
The rest of the paper is organized as follows: Section 1 explains the related work on the methods used.Section 2 introduces the principle of the CNN-XGBoost based on APSO model.Section 3 describes the experimental setup.Section 4 reports the experimental analysis results.Section 5 describes supplementary experiment in detail.Finally, Section 6 concludes the paper and discusses future work.

Materials and Methods
In this section, we introduce the related content and principles about CNN, XGBoost and parameter optimization.

CNN
CNN was first proposed by Professor Yann LeCun et al. used for recognition and classification of handwriting digital images [18].The two most important processes of CNN are convolution and down-sampling.Convolution is to extract features from data, while sampling is to reduce dimension of data.Compared with other neural networks, CNN has the characteristics of Local Connectivity, Weight Sharing and Pooling.Local Connectivity is inspired by characteristic of image space, which is the local pixel spatial connection that is relatively close while the pixel correlation far away is weak.Local Connectivity is achieved by convolution operations.Each neural unit processes only one part of the image and then summarizes the results of each part.Local Connectivity is equivalent to constructing a number of spatial localized filters which can obtain some salient features of the input.Computation and training difficulty of model are reduced through convolution.Weight Sharing is based on the reasonable assumption: "if a batch feature is valid for computation at one space location, it should be valid for computation at other locations".Because the picture has its own inherent characteristics, some statistical characteristics should be roughly the same as others.It is made by each convolution filter (convolution kernel) sharing a matrix with the same weight.Weight sharing reduces the number of parameters and reduces the difficulty of calculation.The Pooling is the process of feature mapping.The input image is divided into a set of non-overlapping rectangles, and output is the maximum value based on these subregions by pooling.The pooling layer operates independently on each depth slice of the input and adjusts their spatial size.It reduces the size of the representation space gradually to reduce the number of parameters, thus reducing the memory footprint, the computation and controlling overfitting.A typical CNN consists of alternating convolution and sub-sampling layers and then turns into fully connected layers when approaching the last output layer.It usually adjusts all the filter kernels by back-propagation algorithm [19], which is based on stochastic gradient descent algorithm, to reduce the gap between the network output and the training labels.Overall, the convolution layer obtains the local features by connecting with local receptive fields.The pooling layer is a mapping feature layer which is used for pooling operation and completing the secondary extraction calculations.Each convolution layer is followed by a pooling layer, and the special twice feature extraction structure makes CNN have strong distortion tolerance on the input images.

XGBoost
XGBoost, developed by Chen and Guestrin [20], is a powerful methodology for regression as well as classification.It is applied as a group of winning programs from Kaggle machine learning competitions.XGBoost, based on the gradient boosting framework, constantly adds new decision trees to fit a value with residual multiple iterations and improves the efficiency and performance of learners.Unlike gradient boosting, proposed by Friedman [21], XGBoost uses a Taylor expansion to approximate the loss function, and the model has a better tradeoff bias and variance, usually using fewer decision trees to obtain a higher accuracy.Details of XGBoost are described below.
Suppose a given sample set has n samples and m features; it can be expressed as , where x is the eigenvalue, and y is the true value.The algorithm sums the results of K trees as the final predicted value, which is expressed as F is the set of decision trees, as follows: where f (x) is one of the trees, and w q(x) is the weight of the leaf nodes.T is the number of leaf nodes, and q represents the structure of each tree, which maps the sample to the corresponding leaf node.Therefore, the predicted value of XGBoost is the sum of the value of the leaf nodes of each tree.The goal of the model is to learn these k trees, so we minimize the following objective function: l is the loss of the difference between the estimated values y i and the true value y i ; common loss functions include the logarithmic loss function, square loss function, and exponential loss function.Ω regularization is used to set the penalty of the decision tree, which can prevent overfitting.Ω is expressed as follows: In the regular term, γ is a hyper-parameter that controls the complexity of the model, and T is the number of leaf nodes.λ is the penalty coefficient for the leaf weight ω, which is usually constant.γ and λ determine the complexity of the model and are usually given empirically.During training, a new tree is added to fit the residuals of the previous round.Therefore, when the model has t trees, it is expressed as follows: Substituting (4) into the objective function (2) yields the function Then, XGBoost carries out the Taylor expansion of the objective function, takes the first three terms, removes the high-order small infinitesimal terms, and finally transforms the objective function into where g i is the first derivative, and h i is the second derivative of loss function respectively.
The residual between the prediction score y t−1 i and y i does not affect the optimization of the objective function, so it is removed.
The iteration of the tree model is transformed into the iteration of the leaf nodes, and the calculated optimal leaf node score is g j and H j is by substituting the optimal value into the objective function, the final objective function is obtained: Overall, XGBoost adds regularization to the standard function as a result of the reduced model complexity.The first and second derivatives are applied to fit the residual error.This method also supports column sampling in both reducing overfitting and reducing computation.Therefore, more improvements lead to more hyper-parameters than the gradient boosting decision tree (GBDT).However, it is difficult to reasonably tune the hyper-parameters.A reasonable setting requires not only the prior knowledge of researchers and their experience in parameter tuning but also a great deal of time.Hyper-parameter optimization is an effective solution to this problem.

Parameter Optimization
The parameter is one of the most significant concepts in machine learning, and the training model essentially finds the appropriate parameters to achieve better results.The parameters are divided into model parameters and hyper-parameters.The model parameters are obtained by learning the distribution of training data, without the need for human experience.The definition of a hyper-parameter is that it is a higher-level concept about the model, such as its complexity or ability to learn.Having a set of good hyper-parameters improves the performance of learning models, so tuning is important for hyper-parameters.However, hyper-parameter tuning is subjective and relies on empirical judgement and trialand-error approaches [22].The hyper-parameter optimization algorithm overcomes the dependence of manual search on experience and trial and error.Common hyper-parameter optimization algorithms include grid search, random search and Bayesian optimization.Below is a brief introduction to them.
Grid search (GS): GS is, within a specified range for the hyper-parameters, a method that uses one step at a time to adjust the hyper-parameters through training and is found to be best with all validation methods.However, a GS cannot widely explore a hyperparameter space because by increasing the iterations of the algorithm to give more opportunities for the number of hyper-parameters, the computational complexity of GS grows exponentially.Therefore, GS is not suitable for the optimization of models with many dimensions [23].
Random search (RS): RS samples a certain number of sets from a specified distribution by randomly sampling within a search range.The theoretical basis is that if the set of random sample points is large enough, the global optimal value or its approximation will be found.
Bayesian optimization [24] in tuning hyper-parameters was intended to optimize the objective function of sample points by being added to the objective function to update the posterior distribution; the calculation process of this algorithm is Gaussian and considers the information of the last hyper-parameter, then adjusts the hyper-parameter to improve the joint posterior distribution slowly.Bayesian hyper-parameter optimization assumes that there is a real distribution, and the noise of the hyper-parameter is mapped to a specific target function.Xia et al. [25] proposed the tree-structured Parzen estimator (TPE), which is a Bayesian optimization method of tuning the hyper-parameters of XGBoost; the results show that the model outperforms other models according to the evaluation measures.Guo et al. [26] used the improved gradient boosting machine (GBM) combined with advanced feature selection and Bayesian hyper-parameter optimization to establish the fitness evaluation model.The experimental results show that the model has higher evaluation accuracy than other models.Putatunda et al. [27] used Bayesian optimization tools Hyperopt, RS and GS to adjust the hyperparameters of XGBoost algorithm on six real data sets.The performance of these three hyper-parameter optimization techniques was compared in the experiment.The results show that Bayesian optimization performs better in precision and time than GS and RS.However, Bayesian optimization is established on the basis of the distribution of the independent prior and the idealized hypothesis that properties are independent of each other.This condition is difficult to attain in practical applications; the number of properties may be large, or the correlation between the properties may be high, thereby causing performance degradation.

PSO
PSO simulates a bird in a flock by designing a massless particle that has only two properties: the speed, which represents how fast it moves, and position, which guides the direction in which it moves.Each particle determines the optimal solution in the search space of an individual and stores it as the current individual extremum.According to the current individual extremum of all the particles, to obtain the current global optimal solution, the whole particle swarm adjusts its speed and position.The process of PSO is as follows: First, initialize the particle swarm; then, evaluate the particles, calculate the adaptive value, search for individual extrema and find the global optimal solution.Finally, modify the speed and position of the particles.
The standard PSO algorithm is as follows: Suppose that in a d-dimensional search space, there is a population of m particles represented as {x 1 , x 2 , . . . ,x m }, where the par- ticles are expressed as , and global optimal position p g = [p 1 g , p 2 g , . . ., p d g ].Then, the speed and position information of the particle can be updated at time t + 1 by the following formula: ω is the inertia weight that maintains an effective balance between global exploration and local exploration, and c 1 and c 2 are the learning factors (random numbers in a uniform distribution function) that adjust the step length of the direction of motion to the position of the particle and the direction of the global best position, respectively.To avoid a blind search by the particle, its speed and position are generally limited to [−V max , V max ] and[−X max , X max ].

CNN-XGBoost Based on Apso Model
In this section, the image classification CNN-XGBoost based on APSO is shown in detail (see Figure 1).The model is divided into two parts: feature extractor, feature classifier.First, feature extractor CNN extracts features from the image data set.Second, XGBoost takes the features to train and classify.Then, according to the fitness value obtained from the model, the improved optimizer APSO is used to optimize the overall framework of CNN and XGBoost.Finally, when termination conditions are met, the optimal value of the hyper-parameter obtained by APSO is used to establish the image classification model.The process is described below: Section 2.1 describes the image feature extractor.Section 2.2 states the feature classifier XGBoost.Section 2.3 introduces the hyper-parameter optimizer APSO.Finally, the overall framework of the model is summarized.

Feature Extractor
Since CNN once again led the development of artificial intelligence, the AlexNet proposed by Krizhevsk et al. [28] has greatly improved the accuracy of image recognition.While the performance of these algorithms is improved, the calculation time is longer and longer with the increase of the model depth.This paper aims to explore the better combination between feature extraction ability of neural network and classification of decision tree; we design the lightweight model.Considering the model calculation, the specific structure of CNN for image classification consists of 7 layers: 1 input layer, 3 convolution layers, 2 pooling layers, 1 full connection layer.Configuration of feature extractor architecture is shown (see Table 1).
Steps of feature extractor as follow: 1.
Initialize the parameters of network.2.
After convolution calculating, feature data obtained though each activation layer and pooling layer.

3.
The feature map forms a one-dimensional vector being processed by a fully connected layer.

4.
Vectors initialized into a new training data set which is used for predicting by subsequent classifier.In this section, XGBoost uses BP algorithm to train the features extracted by CNN, to obtain a tree structure suitable for feature classification.
The tree structure directly affects classification performance of XGBoost.Objective function formula (8), also known as structure score.The value of the function represents the quality of the tree structure.In order to minimize the value of the objective function, one of the key tasks of tree learning is to find the best node partition.By using the score in the instance sets of left nodes and right nodes after the split, the candidate segmentation is evaluated.Based on the evaluation, the optimal node is found to be divided, and finally, a tree with the optimal structure suitable for the data classification is found.

Adaptive PSO
In this section, we introduce PSO improved by adaptive learning strategies.In the process of searching, groups are adaptively divided into subgroups according to the particle distribution.In each subgroup, we use two different learning strategies to guide the search directions of two different types of particles.The search process stops when a global optimal value is found or a termination condition is met.
Relevant studies have shown that the diversity of the population is the key to avoiding the premature convergence of PSO; the core guiding principle of the algorithm is clustering [29].According to the distribution of each particle, the fast search clustering method [30] is adopted to perform the adaptive division of the population into several subgroups.This method can automatically discover the data set samples' class cluster centre.The basic principle is that the centre of the class cluster has two basic features: The first is that it is surrounded by points with lower local density, and the second is that it has a greater distance from points with a higher local density.Therefore, for a population of N particles S = {x i } N i=1 , the two properties ρ i and δ i are defined for each particle.ρ i , the distance between the local density of the particle and a higher local density of particles, is defined as follows:  (11) gives the expression of the distance δ i , representing the minimum distance from particle i to other particles that have a higher ρ i : For the maximum local density ρ of the sample, According to Equation (10), if the density of particle x i is the maximum, δ i is much larger than the distance δ of its nearest particles.Therefore, the centre of the subgroup consists of particles that have an unusually large distance δ and a relatively high density as well.In other words, the particles with larger ρ and δ values are selected as the centre of the cluster.According to the above idea from [30], the formula γ i = ρ i * δ i is used to filter out particles that may become cluster centers.We arrange the γ i values in descending order, then use the truncation distance to filter out the cluster centers from the order.Because the γ value of the top particle is more likely to increase exponentially than those of the other particles, it is distinguished from the γ value of the next particle.Referring to [30], R is set to be between 0.1 and 0.2.Through a parameter sensitivity analysis, we found that the value of the distribution parameter has no effect on the performance of the particle swarm algorithm.The default value in this article is 2. The cluster centre is obtained by dividing by the truncation distance after placing the other particles x j in subgroups where the denser ρ is larger than the ρ of x j and the δ is the closest to the δ of x j .
The particles of each subgroup are divided into ordinary particles, and local optimal particles based on the result of the division of subgroups.Under the primary guidance of the optimal particles, the ordinary particles exert their local search ability, and the updated formula is given as (12).
where ω is the inertia weight, c 1 and c 2 are the learning factors, rand d 1 and rand d 2 are uniformly distributed random numbers in the interval [0, 1], pbest d i is the best position of particles, and cgbest d c is the current best position of particle in the subgroup c.To enhance the exchange of information between subgroups, the local optimal particles are mainly updated by integrating the information of each subgroup.The update formula is as follows (see (13)), where C is number of subgroups.
Ordinary particles search for local optimality, but more importantly, they are used as the medium for information exchange between subgroups to modify the direction of population search and further improve the population diversity.In the same subgroup, unlike a learning strategy that causes too many particles to be gathered locally, the learning strategy integrates the information of the locally optimal particles from different subgroups to obtain more information and help avoid local optima.In addition, learning too much information may lead to the direction of the update being too fuzzy, which may counteract the convergence of particles.Considering that the local optimal particles have the maximum probability of finding the optimal solution in the subgroup, valuable guidance for the optimal solution is provided by their information.Therefore, the gbest d c of each subgroup uses the average information to guide the local optimal particle update (see (13)).The transmission of the optimized information in the subgroups can be improved by this approach, the population diversity can be further increased, and particles can be prevented from falling into local optima.

Training the Model
First, the hyper-parameters of CNN (including the learning rate, batch_size, dropout) and the hyper-parameters in the XGBoost model (including the maximum tree depth, subsample ratio, column subsample ratio, minimum child weight, maximum delta step and gamma-delta) are the optimization targets, and the position of each particle is randomly initialized in the hyper-parameter search space.Second, the particles are divided into adaptive populations.This step is achieved by calculating the local density of the particles and the distances to the particles whose local density is higher.According to the value determined by the position of the particle, we assign the hyper-parameters of the model and bring the verification data into the model for prediction.Finally, the loss function on the verification data set is the fitness function of the particles.The simplified description of credit scoring is a two-category problem.If the labels of the positive and negative samples of credit data are defined as +1/−1, the logistic loss function is defined as where p is the predictive value, and y represents the actual value.In this paper, because our model labels are 0 and 1, the logistic loss is as follows: The particles are divided into ordinary particles and optimal particles in accordance with the fitness value.Different update strategies update the information of the corresponding particles, and the algorithm checks whether the termination condition is reached; if so, we obtain the optimized value.If not, based on the positions of the particles, the model reclassifies the population again, calculates the fitness value, and updates the position information of each particle until the termination condition is reached.Finally, the optimal hyper-parameters are used to construct the model, and the training and prediction are carried out through the data.
The algorithm steps are as follows: 1.
Divide the data sets, train the data for the training model, verify the data for prediction.Initialize the adaptive PSO algorithm.Subgroups of the particle swarms are divided according to Equations ( 10) and ( 11).

2.
Take the logistic loss function as the fitness value, and calculate the fitness value of each particle according to (15).Build the model with the corresponding hyperparameters determined by current best particle.Training and prediction of data sets, and the fitness value is updated by the loss function given.

3.
Determine the position of the global optimal particle pbest and the local optimal particle gbest according to the result of the population division and the fitness values of the particles.

4.
According to (12) and ( 13), update the positions of the ordinary particles and locally optimal particles, respectively.5.
Judge whether to terminate.If the termination condition that iterations is met, return the optimal value of the hyper-parameter; otherwise, return to 2.

6.
Obtain the optimal hyper-parameters to build the CNN-XGBoost model and calculate the indexes.

Experimental Setup
In this section, we evaluate the performance of CNN-XGBoost model by experiment.First, image sets are introduced.Second, we describe the structure setting parameters of the CNN.Finally, the experimental results are analyzed.

Image Sets
In the image classification experiment, experiments were carried out on three data sets: MNIST [31] and CIFAR-10 [32].CIFAR-100.These three data sets are widely used and are specifically used to study the performance of image classification methods.MNIST is a classified dataset of handwritten numbers 0 to 9. The images in the CIFAR-10 data set contain ten categories of natural objects.In CIFAR-10, there are significant differences in the positions and proportions of objects within categories, as well as in the colors and textures between categories.The CIFAR-100 data set is similar to CIFAR-10; it has 100 classes, each containing 500 training images and 100 test images.The 100 classes in CIFAR-100 are divided into 20 superclasses.Each image comes with a tag which is "fine" tag or "thick" tag (superclass) (see Table 2).

Results
Average accuracy (ACC) is one of the most widely used evaluation indexes for classification evaluation.It represents the overall performance of the model and reflects the overall level of classification ability.In order to test the performance of the image classification model proposed in this paper, we evaluate it on the above three databases.All methods are trained on the original training data set.The classification accuracy results are shown in Table 4.
We first compared the MNIST dataset with advanced methods.It includes three combination methods, DLSVM [33], SAE-CNN, CNN-SVM, and two control methods, CNN, PSO-CNN-XGBoost.PSO-CNN-XGBoost represents the CNN-XGBoost model optimized by ordinary PSO, and the other two represent high-performance methods: CIDBM [34], PCAnet [35].It can be seen from the table that the performance of our model is better than other methods on MNIST.Compared with CNN, it reflects the superiority of the two-stage model.Compared with the combination method, our model shows obvious advantages.The reason is that the ensemble classifier XGBoost understands image features better than other classifiers, and it is more compatible with CNN due to stronger classification performance.Compared with the control model PSO-CNN-XGBoost, our model has a significant improvement.It shows that APSO optimizes the hyperparameters of the overall framework to promotes the integration of the two parts, which enables the model to adaptively adjust the framework to grasp image features and improves classification accuracy.Our performance is also better than CDBM and PCAnet, further showing the good performance of the model.
We also compared the image classification methods on the more complex CIFAR-10 dataset.Those compared models include 2 combined models mentioned, 2 control models and 4 high-precision models: DLSVM, Maxout Networks [36], NIN [37] and ML-DNN [38].It shows from the table that compared with other models, our model has the highest accuracy on CIFAR-10, reaching 91.98%, but its advantages are not obvious from the most advanced model: Our test result is only 0.10 higher than ML-DNN.Ours is significantly improved compared with the control PSO-CNN-XGBoost.It shows that the improvement of APSO parameter optimization is obvious, which due to the effectiveness of bidirection optimization form makes the two parts more closely integrated.The optimization mechanism optimizes the two parts of the model as a whole, integrates the learning objectives, and makes the whole model more suitable for image classification tasks.Compared with the combined method CNN-SVM, our model has a huge lead over other models, which further demonstrates the combination of CNN, and XGBoost is even more powerful in image classification.In general, our model is also competitive on the more complex image data set CIFAR-10.In order to explore the performance of our model on more complex data sets and further illustrate the generality of ours, it is compared with other representative methods on the CIFAR-100.It can be seen from the Table 4 that PSO-CNN-XGBoost is slightly better than NIN on ACC.The CNN-XGBoost model under APSO tuning surpasses other methods and achieves the highest accuracy rate, which indicates that APSO adopts an adaptive strategy to guide the particle search well, to avoid falling into the local optimal situation to a certain extent.The APSO makes the framework extracts and utilizes image features fully to improve the classification accuracy.In general, our model architecture can meet the needs of more complex data sets and perform well.The model has good generality.

Additional Experiments
To further explore the performance of the model in terms of classification, we use XGBoost model optimized by APSO to build credit scoring model.First, to eliminate the errors caused by data that have self-variation or large differences in values, we preprocessed the original credit data.Then, we carried out feature engineering, which aims to extract the features from the original data that are maximally useful.In the final step, the model is built with the selected features and optimized hyper-parameters tuned by adaptive PSO, and test data tokens are used to evaluate the trained models.
The model is divided into three parts: data preprocessing, feature engineering and model training.First, the data preprocessing involves standardized data sets and marked missing values.Second, the feature engineering is based on the score of feature importance that is obtained from the initial hyper-parameter model.According to the rank of the feature importance, redundant features are removed.Finally, according to the selected features and hyper-parameters tuned by APSO, the model is built.The flow chart is shown in Figure 2. The process is described in detail below.

Data Sets
In this section, the performance of the model is verified by UCI credit data sets.Two credit data sets, German and Austrian, from the UCI machine learning repository are used.In addition to the above data sets, P2P credit data from two platforms (Lending Club in the US and We.com in China) were also used to verify the effectiveness of our model in providing decision support for P2P lending businesses and to verify the generalization of the model (see Table 5).Data preprocessing is divided into two steps: data standardization, namely, 0-1 scaling, and missing value processing.Although the tree-based algorithm is not affected by scaling, feature normalization can greatly improve the accuracy of classifiers, especially those based on distance or edge calculations.Therefore, the standardization of data sets in data preprocessing makes the model more accurate and persuasive.The training set is described as D = {X, Y}, where X = {x 1 , x 2 , . . ., x m } represents an m-dimensional eigenspace, Y = {0, 1} represents the target value, Y = 0 represents poor application, and Y = 1 represents good application.If x is a certain feature, it is calculated by 0-1 scaling as follows: where x expresses the standardized value.Credit data often have missing values.XGBoost comes with its own sparsity segmentation algorithm, which can learn the best way to deal with missing values and is more suitable for modeling than traditional methods of dealing with missing values.If there are outliers and noise in the data, standardization can indirectly avoid the influence of outliers, and centralization can deal with extreme values.

Feature Selection
At first, the score of the relative feature importance with the initial hyper-parameter is calculated, and the redundant features are discarded by the feature selection algorithm.The importance of feature selection lies in eliminating redundant features, highlighting effective features, improving the calculation speed and eliminating the influence of adverse features on the prediction results.An example as shown in Figure 3, it is the relative feature importance of the German credit dataset on XGBoost model.

Feature Engineering
Feature engineering selects important characteristics and removes irrelevant features to build a model.It can greatly reduce the dimension disaster problem, improve the operational efficiency, reduce the difficulty of learning tasks, make the model simpler and reduce the computational complexity.
By calculating the importance of the features, features that are more favorable to the model are selected.We choose the gain as the feature importance property as normal.XGBoost adopts the stochastic fractal search (SFS) algorithm according to the rank of the importance of the features and adds features into the data set to form subsets one by one.
Under the default hyper-parameter of XGBoost, the subset that minimizes the logistical loss is selected as the subset of the features after 10-fold cross-validation.

Training the Model
To make the hyper-parameter accord with the training data set as much as possible, we use cross-validation on the data set.We tested several cross-validation methods.From many experimental results, we ultimately decided to use 10-fold cross-validation to divide the data sets.Except the feature extractor, the model is consistent with the image classification model steps: First, the hyper-parameters in the XGBoost model are the optimization targets, and the position of each particle is randomly initialized in the hyper-parameter search space.Second, the particles are divided into adaptive populations.This step is achieved by calculating the local density of the particles and the distances to the particles whose local density is higher.According to the value determined by the position of the particle, we assign the hyper-parameters of the XGBoost model and bring the verification data into the model for prediction.Finally, the loss function on the verification data set is the fitness function of the particles.

Baseline Models
To verify the performance of the model, we divided the baseline models into three groups: the traditional machine learning group, the integrated learning group, and the XGBoost group.The traditional machine learning group is DT, LR, NN, SVM, and RF; the ensemble learning group is AdaBoost, AdaBoost-NN, Bagging-DT, Bagging-NN, and GBDT; the XGBoost group is XGBoost-GS, XGBoost-RS, XGBoost-TPE, PSO-XGBoost and APSO-XGBoost.The baseline models are described in Appendix A.

The Evaluation Scale
When the label of sample is 0, the loan application is judged as a default state, indicating that the borrower failed to pay off the loan in time, and his credit is not good; when the sample label is 1, the loan application is non-default, indicating that the borrower fulfills the repayment agreement and he has good credit.In credit scores, the average accuracy is one of the most popular evaluation indices and represents the overall performance of the model.To better explore the ability of the model to distinguish between non-default and default applications, type I errors and type II errors in the confusion matrix are often used to evaluate the models to predict their performance in detail.A type I error is a default loan application being wrongly classified as a non-default.Conversely, a type II error is when a non-default is misclassified as default.TP and TN in the confusion matrix (see Table 6) represent the numbers of correctly classified good borrowers and bad borrowers, respectively.FP and FN represent the numbers of misclassified loan applications.The formulas are defined as follows: The average accuracy (ACC): The type I errors: The type II errors: The Brier score (BS) measures the accuracy of the predicted probability and is the calibration of the prediction performance.The BS ranges from 0 to 1, and the interval value represents probabilistic predictions from perfect to poor.The BS is defined as follows: where N is the number of samples.p i and y i denote the probability prediction and the true label of sample i, respectively.The F1-score takes into account both precision and recall of classification models.It is the harmonic average of these two indicators, and it ranges from 0 to 1.
where precision is the proportion of positive samples in positive cases, it is defined as And recall is the proportion of predicted positive cases in the total positive cases; it is defined as recall = TP TP + FN (23)

Comparisons among Hyper-Parameter Optimization Methods
To demonstrate the performance of the tuning strategies, Figure 4 shows the convergence curves of average loss function of parameter optimization methods over four credit data.The ordinate represents average minimum value of loss function; the abscissa represents number of iterations.It can be seen from the figure that the convergence speed of GS and RS is slower.TPE converges faster, and its result is better than GS and RS.The value of the loss function of PSO is slightly lower than that of TPE, but the convergence speed of PSO is much faster.PSO and APSO have the fastest convergence speed among the parameter optimization methods.PSO has entered the convergence state early, which leads to its unsatisfactory final error rate and failure to find the global optimal value.APSO gets the best performance among all the converging performance.APSO still continues to decline after some iterations, indicating that the optimization mechanism helps subgroups increase diversity, prevents local particles from clustering, and helps particles to find the local optimal value.

Discussion
The ACC is one of the most mainstream and intuitive indicators.The ACC indicates the overall prediction ability of the model.In the German data set (see Table 7), the APSO-XGBoost model obtained the best value, 77.48%, on ACC, which is 1.37% higher than the best performance of the single classifier LR and is 1.37% and 1.16% higher than the best ensemble classifier GBDT in the group.APSO-XGBoost model has obvious improvement over PSO-XGBoost in each indicator.The reason is that when the APSO optimizes the hyper-parameter of XGBoost, the particle swarm easily jumps out of the local optimum and continues to find better values.Hyper-parameters that better match the fitness value give XGBoost a better tree structure, which makes the model prediction accuracy higher.Although the single classifier NN and Bagging-NN perform best in terms of type I error, our model has better results in terms of the other three indicators.APSO-XGBoost performs better in F1-score than other models, indicating that the model can still have more accurate prediction performance than other models under data imbalance.
For the Austrian data set (see Table 8), the ACC of APSO-XGBoost is 1.15% higher than the best-performing single classifier LR and 0.51% higher than the best-performing ensemble classifier RF.Because the numbers of positive and negative samples in the data set are very close, the error distribution of most models is balanced in general, and a few models have an uneven situation.In credit scoring, a high type I error rate guides institutions to adopt a more stringent mechanism to identify more applicants who have a possible default risk and reject them to reduce the risk of bad debt.The XGBoost group performs best in terms of the balance of the error distribution.This shows that the XGBoost classifier is well optimized.The type I error rate of our model ranks second only to that of the single classifier LR model, 11.80%.However, the error distribution of LR is uneven, and its type II error rate is the worst.The model evaluates more good applicants as bad, resulting in a higher type II error rate.In contrast, the error distribution of our method is more balanced than LR because the adaptive particle swarm can optimize multiple hyper-parameters at the same time, which ensures that the tree structure of the model screens both false positive rates and false negative rates.In terms of the BS, our model achieves the best score, 0.086.For the P2P-LC data set (see Table 9), the performance of the XGBoost optimization group members was similar.The ACC of our method improved by 0.82% compared to that of the second-ranked XGBoost-RS, which is 4.46% and 2.08% higher than the bestperforming single classifier LR and ensemble classifier GBDT, respectively.PSO-XGBoost has few improvement over XGBoost-TPE.The improved APSO-XGBoost achieved good performance in terms of the type I error rate.In terms of type II errors, the best score is in the XGBoost optimization group.Although LR is best, its type I error rate is high.This shows that the error distribution is unbalanced.As far as the BS is concerned, XGBoost-APSO has the best performance, indicating that the model is more accurate in its prediction probability than the other models.
On the unbalanced P2P-We data set (see Table 10), the ACC of APSO-XGBoost reached the best value, 84.72%, which was 11.46% higher than the ACC of the DT, which performed best in the single classifier group, and 1.43% higher than that of the best classifier, GBDT, in the ensemble learning group.The XGBoost group generally had a smaller type I error rate than the other models, indicating that models based on XGBoost are less affected by an unbalanced distribution than the classifier models.PSO-XGBoost has a small improvement over XGBoost optimized by other hyper-parameters optimization methods.APSO-XGBoost fully extract image features but also uses XGBoost to overcome the shortcomings of a single classifier and effectively distinguish features, and APSO optimizes the hyperparameters of the overall architecture.APSO uses two different learning strategies to update information of particles, enhance the diversity of particle populations and avoid the algorithm from falling into local optimality.Thereby, the adaptive processing capability of the model to image features was improved, and the classification accuracy got better.APSO optimizes both CNN and XGBoost.On the one hand, CNN is optimized to extract deep features so that the extracted features are more suitable for decision tree XGBoost.In addition, XG-Boost is optimized to make the structure of the model better match the extracted features, so as to better understand the image features.Bidirection optimization structure to fully extract the features of the image and fully used for classification.Our model has the best results on the image data set compared to other models, which shows the effectiveness of the model.In addition, the experimental results on the additional data set show that the proposed APSO-XGBoost credit scoring model also achieved good results on credit data, indicating the model has strong generalization ability.
Future work may modify the proposed credit scoring model, optimizing deeper models with the novel hyper-parameter optimization methods.
) where d ij is the Euclidean distance of particles between x i , and x j and d c is the truncation distance.The truncation distance is d c = d R * M , where R represents the proportion and M indicates that the matrix d ij contains M = 1 2 N(N − 1) values, where N represents the number of particles.It can be seen that d c is the distance corresponding to the R * Mth value of d ij .

Figure 3 .
Figure 3. Relative feature importance on German credit dataset.

Funding:
This research was funded by the National Natural Science Foundation of China under Grant number 61972227, Grant number 61873117 and Grant number U1609218; in part by the Natural Science Foundation of Shandong Province under Grant number ZR201808160102 and Grant number ZR2019MF051; in part by the Primary Research and Development Plan of Shandong Province under Grant number GG201710090122, Grant number 2017GGX10109, and Grant number 2018GGX101013; and in part by the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Configuration of feature extractor architecture.

Table 2 .
Description of the image data set.

Table 3 .
Search space set of CNN and XGBoost by APSO.

Table 5 .
Description of the experimental data sets.

Table 7 .
Results of the measured performance of models on the German data set.

Table 8 .
Results of the measured performance of models on the Australian data set.