An Ensemble Extreme Learning Machine for Data Stream Classification

Extreme learning machine (ELM) is a single hidden layer feedforward neural network (SLFN). Because ELM has a fast speed for classification, it is widely applied in data stream classification tasks. In this paper, a new ensemble extreme learning machine is presented. Different from traditional ELM methods, a concept drift detection method is embedded; it uses online sequence learning strategy to handle gradual concept drift and uses updating classifier to deal with abrupt concept drift, so both gradual concept drift and abrupt concept drift can be detected in this paper. The experimental results showed the new ELM algorithm not only can improve the accuracy of classification result, but also can adapt to new concept in a short time.


Introduction
With the explosively growing Internet and rapid development of information society, many industries have generated a large number of data streams, such as medical diagnosis, online shopping, traffic flow detection and satellite remote sensing.Different from conventional static data, data streams often have the characteristics of infinite quantity, rapid arrival, and conceptual drift, which make data stream mining faces an enormous challenges [1][2][3].Since data stream classification was put forward, it has attracted much attention from scholars and made many achievements [4][5][6][7][8][9].Up to now, the achievements are divided into three groups: statistical analysis model, decision tree model and neural network model.In statistical analysis model, Brzezinski et al. proposed an online leaning algorithm called OAUE [10] which utilizes mean square error to determine the weight of the classification model.When the detection period is reached, the concept drift will be replaced by replacement strategy.Farid et al. proposed a weighted case ensemble classification algorithm [11], and clustering algorithm is introduced to detect concept drift.If a data point does not belong to any existing class, it is considered that the class corresponding to this data may be a new concept, and then is further confirmed by data statistics in nodes.Bifet et al. proposed an adaptive window algorithm called HWF-ADWIN [12].It uses Hoeffding inequality [13] to divide the nodes with the attributes corresponding to the maximum and second largest information gain to train a classifier; when the accuracy of the classifier is significantly changed, concept drift will be thought to have happened.Xu et al. proposed a data stream classification method based on Kappa coefficient [14]; in the process of classification, the algorithm calculates the Kappa coefficients of each block, and detects the changes of concepts in data streams by using Kappa coefficients.When the concept of data stream is changing, the system will eliminate the classifiers which do not meet the requirements according to the existing knowledge.Compared with the contrast algorithm, this algorithm can not only obtain a higher accuracy, but also reduce the time cost to a certain extent, and get better results.Decision tree model is very common in data stream classification tasks and there have been many publications.Domingos and Hulten et al. proposed a series of algorithms based on Hoeffding tree called VFDT and CVFDT [15,16]; Wu and Li et al. proposed semi-random decision tree algorithms [17,18]; Brzezinski et al. proposed a red-black tree structure algorithm to improve the efficiency of finding and removing outdated nodes for imbalanced data stream classification [19].Rutkowski et al. developed a McDiarmid Tree algorithm according to McDiarmid inequality and the threshold of the difference between the maximum information gain and the second large information gain is determined by the McDiarmid boundary [20].With the heat of the neural network, many scholars apply neural network in data stream classification tasks.Aiming at imbalanced data stream classification [21], telecommunication fraud detection [22], spatiotemporal event streams [23] and so on, many algorithms have been proposed.However, statistical analysis model, decision tree model and neural network model need to repeatedly scan data classifiers and data several times, or there are many parameter needing to adjustment.Thus, the above drawbacks limit these models to be more widely used in data stream environment.
Extreme learning machine is a single hidden layer feedforward neural network; the input weights and biases of hidden layer are randomly generated and the output weights can be automatically determined by input data [24][25][26][27][28]. ELM does not need to adjust the parameters repeatedly and it has an obvious advantage in the speed of the training process comparing with the traditional neural networks [29], so it is very suitable for data stream classification tasks.Liang et al. proposed an ELM algorithm based on online sequential learning mechanism called OS-ELM [30], and it extends ELM to the field of data stream classification.After OS-ELM being proposed, many scholars have proposed a series of improved OS-ELM.Gu et al. proposed a timeliness online sequential extreme learning machine for timeliness problem [31]; it adopts the batch processing and weighting mechanism to make TOSELM have good stability and prediction ability.Shao et al. proposed a regularization extreme learning machine with online sequential learning called OS-RELM [32].OS-RELM combines OS-ELM and RELM [33]; at the same time, the minimum error rate is guaranteed, and the norm of the minimum weight is obtained, so that OS-RELM can have good generalization performance.Zhao et al. proposed a FOS-ELM with forgetting mechanism for timeliness stock data [34].In FOS-ELM, it only uses latest data to update model, so it can avoid the invalid data to participate in updating the weights of the output layer.Bilal et al. proposed an ensemble online sequential extreme learning machine for imbalanced classification [35]; each OS-ELM focuses on the minority class data and is trained with a balanced subset of the data stream.For distributed multi-agent system, Vanli et al. proposed a online nonlinear extreme learning machine [36]; it uses optimization method to minimize empirical risk and structural risk.Singh et al. applied OS-ELM in intrusion detection system [37]; before dealing with data, it introduces features selection to eliminate redundant or unrelated attributes.
The above OS-ELM and its developments provide a number of ways to solve the problem of data stream classification.However, most of them lack concept drift detection mechanism; they have a good performance for data stream without concept drift or concept changing slowly, but cannot cope with the rapid change of concept in data stream.In this paper, an ensemble extreme learning machine with concept drift detection (CELM) is proposed.CELM uses manifold learning to reduce the dimensions of data and introduces concept drift detection mechanism which effectively overcomes the shortcomings of OS-ELM.The contributions of this paper are as follows:

•
An ensemble extreme learning machine algorithm is presented.In the data stream environment, the performance of ensemble classifiers is better than that of single classifier [38], so CELM employs ensemble learning method and improves the performance of ELMs.

•
Because data stream classification is very demanding for real time and the high dimensions of data tend to reduce the efficiency of algorithm, CELM introduces a manifold learning method to reducing the dimension of data which reduces the time consumption of CELM.

•
Concept drift detection is incorporated into the training process of ELM classifiers.The change of data stream is divided into three categories: normal condition, warning level and concept drift.
Different from the traditional ELMs, CELM not only can detect gradual concept drift, but also can handle abrupt concept drift.
The rest of this paper is organized as follows: Section 2 reviews the background knowledge of data stream classification and ELM.Section 3 states the details of ELM, and then elaborates the reducing dimension method of the manifold learning and the principles of CELM.In Section 4, CELM is compared with comparison algorithms and we discusses the experimental results.Finally, Section 5 concludes the research and gives future directions.

Background Knowledge
In this section, we give a brief introduction about data stream classification and extreme learning machine and explain their basic principles.

Data Stream Classification
stream generated by a system, and d t a datum at t moment; where m is the features number of d t and y t is the class label.Data stream classification generally adopts a sliding window mechanism, and several data make up a dataset called data block and denoted B i , where and n is the size of data block.At every moment, only one or several data blocks are allowed to enter sliding window.After one data block is processed, a new data block can be loaded to sliding window.
Suppose that in ∆t time, if the error rate of classifier system is at a low level in the sliding window, it is said that the concept of data stream is stable in this period and P (error where error is the current error rate of classifier system, best is the classification error rate of optimal performance classifier for data stream and α is a significance level.Let the classification model of data stream be M, which is trained by the data blocks in sliding window at t moment; after ∆t time, the classification model changes to N. If M = N, it means concept drift has happened in data stream.If ∆t is a short time, the concept drift is called abrupt concept drift; otherwise, it is called as gradual concept drift [14].

Extreme Learning Machine
Extreme learning machine is a single hidden layer feedforward neural network.The input weights and biases are randomly generated, while the output weights can be automatically determined.Compared with the traditional methods such as BP neural network [39], the speed of ELM is faster [40,41].The structure of ELM is shown in Figure 1.For N arbitrary distinct samples, {x i , where w j = w j1 , w j2 , • • • , w jn T is the weights connecting the jth hidden node with the input nodes, is the weights connecting the jth hidden node with the output nodes, b j is the bias of the jth hidden nodes.According to the theory [24], ELM can approximate these N samples with zero error and ∑ N i=1 o i − t i = 0. Thus, the output of ELM can be expressed compactly as where H is the output matrix of hidden layer and T is the output matrix of output layer.They are as: The output weights matrix β can be estimated as where H † is the Moore-Penrose generalized inverse of the hidden layer output matrix H.It can be computed by orthogonal projection method, orthogonalization method and singular value composition (SVD) [42].To improve the generalization performance of ELM, regularization is introduced and the optimization problem of ELM is as follows: where C is a penalty factor, and ξ i is the training error which is used to eliminate over-fitting.According to KKT conditions [26], if L < N, the β is as Thus, the output of ELM is as If L ≥ N, the β is as Thus, the final output of ELM is The classification label of ELM is as where From the above descriptions, the steps of ELM are summarized as follows (Algorithm 1) [24,25]: the number of hidden nodes L; the activation function g(•); Output: ELM classifier.
Step 1: Randomly generate the input weights w j and biases b j , j = 1, 2, • • • , L; Step 2: Calculate the output matrix of hidden layer H for dataset X ; Step 3: Obtain the output weights β according to Equation (6) or Equation ( 8);

The Basic Principles of CELM
In this section, we introduce the dimension-reduction method which is used to reduce the dimension of the data at first, and then explain the details of concept drift detection mechanism and classification steps of CELM.

The Method of Dimensionality Reduction for Data Stream
Dimensionality reduction is important for data stream classification.It can reduce the dimension of the data and improve the efficiency of the algorithm.In this paper, LLE method [43] is used to handle data stream.Let a data block be https://cs.nyu.edu/~roweis/lle/)finds k neighborhood points of x i to reconstruct x i .The objective function of the optimization problem is as follows: where w ij is the weight of the neighborhood sample x j .If x j is not the neighborhood of x i , w ij = 0. From Equation (11), it follows: where , so it will have where 1 k is a vector in which all elements are 1.The optimization function of Equation ( 13) can be expressed as From Equations ( 13) and ( 14), it will obtain For The objection of dimension reduction is to make the following loss function is minimized.
Equation ( 16) can be changed as Let M = (I − W ) T (I − W ), so the objective function of the optimization problem is Construct the following Lagrange function By solving the partial derivation of L(Y), it will get Equation (20) means Y is the eigenvectors of M. If it wants to get d-dimensional data, it only needs to find a matrix which is made up by d + 1 eigenvectors corresponding to the least d + 1 eigenvalues of the matrix M, and Y = {y 2 , y 3 , • • • , y d+1 }.The dimension-reduction algorithm of CELM is as follows (Algorithm 2).

Algorithm 2 Dimension-reduction of data stream.
Input: Data stream S, the size of data block B i : winsize, k and d; Calculate d+1 eigenvectors of the matrix M; Get the low dimensional matrix Y;

The Data Stream Classification and Concept Drift Detection of CELM
Data stream is different from the traditional static data, concept drift is often happened, so concept drift detection must be included in the training process.For a data block B i , the error rate of classifiers is p i which is a random variable obeying the Bernoulli distribution, so the standard deviation is where i is the number of samples [44,45].In this paper, CELM utilizes p i and s i to detect concept drift.The change of data stream is divided into three types: stable, error level and concept drift.
If p i + s i ≤ p min + 2s min and p i < ε, it suggests that the error rate of classifiers system is in a low level and the concept of data stream is stable where ε is a threshold.Thus, the classifiers are suitable for the classification task of the current data stream and they do not need to make any adjustment.
If p i + s i ≥ p min + 2s min and p i < ε, it suggests that the error rate of classifiers system is still in a low level, but the performance of classifers has a big fluctuation, the classifiers will give a warning and CELM will use online sequence learning mechanism [30] to update each classifier.At the initial time, let the data block be B 0 = {x i , t i } N 0 i=1 , so the output matrix of hidden layer H 0 and the initial target matrix of T 0 are The initial output weight of ELM β (0) is where T .After (k + 1)th data block coming into sliding window, the data block is . The output matrix of hidden layer H k+1 is The K k+1 and β (k+1) are updated as when calculating the output weight matrix β , it needs to perform a matrix inversion, but the calculated amount of the pseudo inverse is very large, so Woodbury formula is often used to diminish the computation [37] and the formula is as By the online sequential learning mechanism, when the change of concept in data stream is small, CELM can update classifiers to adapt to the change of concept which is also effective for gradual concept drift.
If p i + s i ≥ p min + 2s min or p i ≥ ε, it indicates that the change of data stream is too large or the performance of classifiers is in low level.The classification model is not fit for the current data stream, so all classifiers must be deleted and retrain a series of classifiers.The steps of CELM are summarized in Algorithm 3. From the steps of CELM, it is known that, when the change of data stream is small, CELM uses online sequential learning mechanism to update classifiers which ensures the classifiers can utilize the last model and do not need to be retrained again and again; in other words, the method also gives a way to handle gradual concept drift.In addition, the dimension-reduction algorithm which preprocesses data blocks and the advantages of ELM makes CELM keep a good performance and have a fast speed.

Experiments and Data Analysis
In the section, experiments and data analysis are executed to test the performance of CELM.OS-ELM [30], SEA [46], AE [47] and M_ID4 [48] are used as comparison algorithms.All algorithms were executed on MATLAB 2017a platform, windows 7 OS, Intel quad-core 3.30 GHz CPU and 8 G memory.There are 10 artificial and real datasets for experimental datasets.The base classifier of SEA, AE and M_ID4 is decision tree and the number of sub-classifiers is set to 5. For CELM, the parameter C = 1000, the neighbourhood k = 5 and the threshold ε = 0.3.For M_ID4, the threshold θ = 0.01 and the decay factor b = 0.5.The activation function of CELM and OS-ELM is sigmoid.

Datasets
At first, we will give a brief introduction about datasets.All artificial datasets are generated from MOA platform [49].In artificial datasets, we only give a explain about hyperplane dataset, the other description of datasets can be see from UCI website (http://archive.ics.uci.edu/ml/datasets.html)and help handbook.The basic information of datasets are shown in Table 1.
hyperplane is a gradual concept drift dataset.In a d-dimensional space, a hyperplane is defined as ∑ d i=1 w i x i = w 0 , where x i ∈ [0, 1], w i ∈ [−10, 10] and w 0 = 1 2 ∑ d i=1 w i .If ∑ d i=1 w i x i ≥ w 0 , the point is remarked as positive; if ∑ d i=1 w i x i < w 0 , the point is remarked as negative.

The Comparison Results of CELM and Comparison Algorithms on the Test Datasets
To test the performance of CELM and comparison algorithms, the algorithms are executed on 10 datasets.The test results are shown in Tables 2 and 3.  Tables 2 and 3 show that CELM gets best results on four datasets; SEA, AE and OS-ELM get the best results on two datasets; and M_ID4 gets only one best result.In addition, the average accuracy of CELM is also the best of all.For time consumption, OS-ELM is the least of all and CELM the second least, but the accuracies of CELM are much higher than OS-ELM.Thus, it can be concluded that the performance of CELM is better than the other algorithm in most conditions.On Ozone dataset, CELM and OS-ELM get the same highest accuracy because Ozone has no abrupt concept drift and CELM degenerates into OS-ELM; in other words, there will be no difference between CELM and OS-ELM when dataset has no abrupt concept drift.In Figure 2a-j, the accuracies of CELM and OS-ELM are changing with different winsize values.On the voice, waveform, letter, occupancy and protein datasets, CELM is much better than OS-ELM; the classification performance of OS-ELM is at a low level because there are many abrupt concept drifts in those datasets.It suggests that OS-ELM is not fit for dealing with data stream with abrupt concept drift and CELM has an obvious advantage in handling data stream with abrupt concept drift.On the other datasets, the test results of OS-ELM is better than that of CELM.If analyzing the change of the curve, it is known that there is no big difference between OS-ELM and CELM in accuracy and both get good results because the change of concepts in those dataset is small.Therefore, it can be concluded that CELM can cope with gradual concept drift and abrupt concept drift, but OS-ELM can only face gradual concept drift; thus, CELM is better than OS-ELM.

The Effect of the Values of d on the Performance of CELM
To test the effect of d on the performance of CELM, this paper executes CELM with different values of k which is a parameter of Algorithm 2. The activation function of CELM is sigmoid; the size of sliding window is 90; and the number of hidden nodes is 30.
Table 4 is the dimension decrement of the datasets testing on CELM.From the result analysis of Table 2, it can be known that the performance of CELM is the best.CELM reduces the dimensions of most datasets.In other words, the dimensionality reduction methods of manifold learning in CELM is effective.Figure 3 presents the result of CELM testing on the experimental datasets with different d values.d is an important parameter for the dimension reduction algorithm which is presented as Algorithm 2. Data will lose more information if d is a small value and data will have many redundant features if d is a larger value.It is known that the performance of CELM will change when d value changes.The accuracy of CELM has a large fluctuation on voice, waveform, adult, letter, occupancy, hill and protein datasets and the accuracy of CELM has less fluctuation on the other datasets, as shown in Figure 2 and Table 5.It manifests d values can affect the effect of dimensionality reduction algorithm.In addition, it is obvious that the performance of CELM will be affected if the value of d is too large or too small, therefore the user needs to select a appropriate value for the manifold learning algorithm.

Conclusions
Data stream classification is a hot research topic in recent years.How to deal with the data stream with concept drift has a high value of practical application.A new ensemble extreme learning machine with concept drift detection (CELM) is presented in this paper.CELM applies manifold learning method to reduce the dimensions of data blocks and divides the changes of concepts in data stream into three types: stable, warning and concept drift.The algorithm can detect both gradual concept drift and abrupt concept drift by online sequential learning and concept drift detection mechanisms which expands the application scope of ELM.The experimental results also prove that the proposed algorithm is effective for data stream classification.
It is obvious that this algorithm still has some problems to be solved.The number of hidden nodes L and the parameter of the manifold learning algorithm d have a great impact on CELM.How to select appropriate values for those parameters will be a research direction for future works.

Figure 1 .
Figure 1.The structure of ELM.

:
Data stream S, the size of data block B i : winsize, k and d, ε, K classifiers; Output: An ensemble classifiers system.while S = NULL do Get a data B i from sliding window; Use Algorithm 2 to descend dimension for B i ; if p i + s i < p min + 2s min &&p i < ε then The data stream is stable and directly uses classifier to finish classification task; else if p i + s i ≥ p min + 2s min &&p i < ε then Uses online learning mechanism to update classifiers as Equations (21)-(27); else if p i + s i ≥ p min + 2s min ||p i ≥ ε then Concept drift has happened; Delete all classifiers and retrain each classifier as Algorithm 1;

Table 1 .
The information of the experimental datasets.

Table 2 .
The test accuracies of the algorithms on the experimental datasets.

Table 3 .
The time consumption of the algorithms on the experimental datasets.

Table 4 .
The dimension reduction result of CELM testing in Table2.

Table 5 .
The accuracy standard deviation of CELM testing in Figure3.