1. Introduction
The rapid growth of technology increases application areas day by day. In recent years, the developed application areas such as social networks [
1], electronic business [
2], cloud computing [
3,
4], computer networks measurement [
5,
6,
7], internet of things applications [
8], etc. are generating large volume data [
9]. Such large volume data are known as data streams, and they have different characteristics. The data stream is an infinitive sequence. The probability distribution of the data stream may change over time dynamically. It is processed in real-time without intermission. Besides, each instance arrives as continuous streams. Therefore, all data are not available from scratch, and the arriving data order cannot be controlled. Each instance has a vast scale and can be analyzed only once. Collecting true class labels of all instances in-stream is infeasible for real-time scenarios. The characteristics of the data streams have brought a huge challenge to processing them [
10].
Feature extraction is one of the main processing steps of data mining and machine learning applications. It aims to extract useful features by projecting data into lower dimensional space from high dimensional space. The extraction of features helps to reach accurate results in large-scale data stream classification applications. However, traditional feature extraction techniques fail to satisfy the requirements of the data stream. Therefore, the incremental feature extraction techniques have been designed to facilitate the feature extraction of the data stream [
11]. They aim to solve the problems of traditional methods for data streams. Although incremental feature extraction techniques can answer various problems, some techniques are not suitable for high-dimensional data streams in terms of time complexity and computational costs.
The large majority of unsupervised approaches can be easily applied to the data stream. However, most of them have problems such as data dependency and high computation cost in determining the eigenspace for large scale data. Moreover, they require a long time to instantly processing each incoming data. These problems make the incremental feature extraction algorithms complicated for data streams. This difficulty causes to use of alternative unsupervised approaches such as Discrete Cosine Transform (DCT) [
12].
In this paper, the DCT-based feature extraction approach is presented for the data stream. The study aims to show the DCT-based algorithm being superior to well-known incremental feature extraction algorithms for data stream feature extraction. The well-known incremental Principal Component Analysis (IPCA) based feature extraction approaches are employed for performance comparison. The main contributions of the paper are as follows.
A novel, efficient incremental feature extraction approach-based DCT is developed for the data stream to overcome the computation cost and time complexity problems.
The proposed approach is based on DCT and PSO. To our knowledge, it is the first time using DCT and PSO for the data stream feature extraction algorithm.
The remainder of this paper is organized as follows.
Section 2 briefly introduces a quick review of the well-known IPCA algorithms and DCT-based data stream approaches.
Section 3 presents the proposed DCT-based data stream feature extraction approach in detail. The experimental settings and performance evaluations are given in
Section 4. Conclusions and future works are presented in
Section 5.
Notation—The bold letters denote vectors; x is spatial coordinate in sample domain; denotes a 1-D input vector with N data; u is frequency coordinate in transform domain; denotes the 1-D DCT coefficients vector with N values; is constant whose value depends on u; denotes ith sample of data stream.
2. Related Work
In this section, unsupervised incremental feature extraction algorithms and DCT-based data stream studies are reviewed.
The basis of the most popular incremental feature extraction algorithms is based on PCA. PCA [
13] was proposed to use as a dimensionality reduction and feature extraction algorithm for traditional data. Many incremental versions of PCA (IPCA) were proposed to perform PCA in incremental learning manner. In the literature, IPCA algorithms are divided into two categories [
14]: The first category algorithms are based on calculating eigenvectors and eigenvalues for each new incoming sample. The different representations of the covariance matrix in an incremental manner are the main reason for the variation of the IPCA algorithms in this category. Due to the characteristic of incremental learning, the covariance matrix must be updated with each new data. However, as the scale and the number of features increase, the computation cost correspondingly increases. The updating of the covariance matrix and calculation of new eigenspace becomes difficult for each new data. Besides, these algorithms have an unpredicted approximation error problem.
The first IPCA algorithm was proposed by Hall and Martin in the literature [
15]. The algorithm updates the covariance matrix for each data using the residue estimating method. The authors later improved their studies using a chunk structure instead of using only one data. The study is based on merging and splitting the eigenspace using the chunk structure [
16]. Liu and Chen [
17] proposed an approach based on incremental updating of the eigenspace to detect video shot boundary. The algorithm computes histogram representation as soon as the new frame arrives. Afterward, the determined eigenspace utilizes the features of new frames to detect the shot boundary. Finally, the eigenspace is updated by PCA-based incremental algorithm. Li [
18] developed an incremental and robust subspace learning algorithm. This algorithm has two eigendecompositions steps for computing the eigenvalues and eigenvectors. First, the algorithm calculates the initial principal components with the first observations. Afterward, the main eigenvectors are obtained by using previous eigenvectors and eigenvalues for new observations. Although the algorithm is easy to implement, it suffers from the time complexity, and computational and compilation cost; this applies to other PCA algorithms as well. In another study, Ozawa [
19] proposed an extended IPCA algorithm based on the accumulation ratio. The eigenspace is updated using the rotation of the eigen-axes and the dimensional augmentation in IPCA algorithms. The dimensional augmentation is obtained when the norm of residue vector is larger than a threshold value. If the threshold value is too small, redundant eigenspace is obtained. This causes to decrease in computational efficiency and performance. Therefore, determining the best threshold value is a challenge for the existing algorithms. Due to the need for determining the best threshold value, the extended IPCA uses the accumulation ratio. Later, Ozawa et al. [
20] enhanced their extended IPCA by adding chunk structure to algorithm and called as chunk IPCA. The chunk IPCA uses a chunk model instead of a one-pass data model. The eigenspace is incrementally updated for the chunk of samples at a time. Zhao et al. [
14] developed an incremental learning and feature extraction algorithm called SVDU-IPCA. The algorithm uses the SVD updating algorithm, and it does not require to recompute the eigenspace from scratch. Rosas-Arias [
21] proposed an online learning methodology for counting vehicles in video sequences. The approach is based on IPCA, which employs SVD algorithm. Fujiwara [
22] presented incremental dimensionality reduction algorithm based on IPCA for visualizing streaming multidimensional data. The presented approach uses SVD for computing eigenspace. All IPCA algorithms and applications in the first category [
23,
24,
25] suffer from high computational cost and time complexity in the requirement of determining or updating eigenspace for each data stream. IPCA algorithms are not suitable for the data stream, which requires an instant response, due to having data dependency, high computational cost, and time complexity.
The second category IPCA algorithms are based on computing the eigenspace without using the covariance matrix. The eigenvectors are calculated one by one using the higher-order principal components. Therefore, it is necessary to be known in advance how many eigenvectors must be calculated. In addition to these problems, the traditional PCA and its incremental versions have data dependency problems. When adding new data to the database, the recomputation of the covariance matrix and eigenspace is required. Candid Covariance-Free IPCA (CCIPCA) [
26] is a well-known and fast incremental algorithm in this category. CCIPCA does not require to reconstruct the covariance matrix for each new data using the SVD based algorithm. It determines the eigenspace sequentially. The current principal component is a base of the next principal component in CCIPCA. The most dominant principal components are first computed, and then the second can be obtained by using the first one. CCIPCA is a suitable algorithm for the data stream, and it attracts researchers’ attention to develop feature extraction algorithms for data stream [
10,
27]. Wei [
27] proposed covariance-free incremental covariance decomposition of compositional data (C-CICD) for data stream, which is based on the idea of CCIPCA. However, the increase in the number of samples and features does not affect the time complexity of the algorithm linearly. Moreover, the error is propagated and CCIPCA does not estimate the last eigenvectors accurately due to the incremental computation of principal components [
14].
PCA and IPCA algorithms are linear transformations, and they linearly extract features. However, the linear transformation could not satisfy the needs. In such circumstances, the kernel structure could obtain more accurate results. In the literature, Kernel PCA (KPCA)-based incremental algorithms are proposed for extracting features of data stream [
28,
29,
30,
31,
32,
33,
34]. Incremental KPCA (IKPCA) algorithms suffer from the same problems as IPCA. Moreover, choosing the best kernel type for the data stream is a challenge in IKPCA.
The discussed problems make IPCA algorithms difficult to utilize for data streams. This difficulty causes researchers to use alternative approaches for incremental feature extraction. The most popular approach is DCT [
12]. DCT is successfully used in many different research areas for feature extraction [
35,
36]. Although DCT has been reported in the literature as the best transformation approach after PCA with energy compaction [
37], it gains an advantage over PCA in many aspects [
38]. The DCT is not a data-dependent algorithm. It does not require recomputation when adding new data to the database. Therefore, computational cost and time complexity are no problems for the DCT. Moreover, DCT can be easily implemented using fast algorithms. The advantages and structure of the DCT show that the algorithm can respond more quickly to the streaming data in comparison with PCA. In the literature, there are limited DCT-based data stream studies. The existing DCT-based studies are about data stream clustering [
39], analysis of concept drift problem [
40], and analysis of data stream [
41]. Apart from these studies, Sharma [
42] proposed a visual object tracking method based on sparse 2-D DCT coefficients as discriminative features and incremental learning. The discriminative DCT features are selected by using feature probability and ratio classifier criteria in this study. However, this study needs to perform IPCA for subspace learning, and the authors did not tackle the problem as a data stream manner. There is no DCT-based data stream feature extraction and dimensionality reduction study in literature. Existing studies are based on processing feature extraction in real-time [
43,
44,
45]. However, real-time applications need a collected training set to construct a model. This need points batch learning, and it conflicts the nature of the streaming data.
4. Results and Discussion
In this section, the evaluation of the DCT-based data stream feature extraction approach is presented on real and synthetic data sets with respect to PCA and IPCA algorithms. The linear and nonlinear feature extraction algorithms are employed for comparison. The linear algorithms are the traditional PCA [
13], IPCA proposed by Li [
18] (IPCA-Li), IPCA proposed by Ozawa [
19] (IPCA-Ozawa), and CCFIPCA [
26]; the nonlinear algorithm is CIKPCA [
30]. The proposed approach, PCA and IPCA algorithms have been implemented in MATLAB (R2016b) under Windows 10 (64-bit OS). The CPU of the computer is an Intel
® Core
TM i7-7500 (2.70 GHz) with 8 GB of random-access memory. All algorithms have been implemented as reported in their original papers. The result CIKPCA is used as reported in their original papers. Three main experiments are focused in this study. The first one is to investigate the influence of the proposed feature extraction approach on classification. The accuracy rate (Acc) [%], the number of the data stream that classified correctly (NDSCC), and F-score are evaluation metrics in the first experiment. Another experiment is to examine the influence of variation of the DCT coefficients. The last experiment is to investigate the effect of automatic feature selection in the proposed DCT-based feature selection approach. PSO and APSO algorithms have been implemented in MATLAB (R2016b) to handle automatic feature selection.
4.1. Data Sets
The evaluation of the proposed approach is performed on real and synthetic numeric data sets. The Forest Cover Type is a real data set, and it is available on the UCI Machine Learning Repository [
56]. The Forest Cover Type data set contains 581.012 observations of seven forest cover types in 30 × 30 m
cell, and each observation consists of 54 geological and geographical variables. The data set includes ten quantitative variables, forty binary soil type variables, and four binary wilderness areas for describing the environment. A randomly generated subset of 100,000 data from Forest Cover Type is used in this paper.
The Poker-Hand is a real data set and available on the UCI Machine Learning Repository [
56]. The data set consists of a poker hand of five cards, which drawn from a standard deck of 52. It contains one million instances, eleven attributes, and two class information. The last attribute describes the class information. In this paper, a randomly generated subset of 100,000 data from Poker-Hand is used as in Forest Cover Type.
The ElecNormNews is real data set described by M. Harries and analyzed by Gama [
56]. The data set is a normalized version of the Electricity data set. It consists of 45,312 instances and eight attributes. The last attribute of each instance describes class information, and the data set consists of two classes. ElecNormNews was obtained from the Australian New South Wales Electricity Market.
The Optic-digits is optical character recognition data set. It contains 5620 instances, 64 classes, and 10 attributes. The Optic-digits is available on the UCI Machine Learning Repository [
56].
The DS1 and Waveform are synthetic data sets and were generated through Massive Online Analysis (MOA) [
57]. The DS1 data set consists of 26,733 instances, 10 attributes, and two classes; The Waveform data set consists of 5000 instances, 21 attributes, and three classes. The summary of data sets as used in this paper is given in
Table 1.
4.2. The Classification Performance
In this section, the performance of the proposed method is evaluated by comparing with linear and nonlinear feature extraction algorithms. Three different methods are employed to evaluate the classification performance. The first one (M1) is a sliding window model that is given as
Figure 3, and it has a traditional structure used in incremental learning approaches for the data stream in the literature. The second one (M2) is to use only the first certain number of the data stream samples for the initial model. A new incoming streaming data sample is used for the only classification. This method has similar structure with batch learning. Therefore, only traditional PCA is performed with M2 as in
Table 2. The last method (M3) is based on adding a new sample to the initial model without eliminating the old and outdated samples. Thus, the sample number of the initial model is increased as the time going by. The usage of the model is shown in
Figure 4.
Table 2,
Table 3 and
Table 4 show the accuracy rates and the NDSCC scores for the traditional PCA, IPCA-Li, CCFIPCA, CIKPCA, and the proposed method. CIKPCA uses Waveform and Optical-digit data sets in its original paper. Therefore,
Table 4 only includes the results of Waveform and Optical-digit data sets.
In this experiment, the sliding window size is determined as 1000 according to experimental results, and so the first 1000 samples of all data sets are used for the initial model. After the initial phase, entire samples are used in the test stage and processed one by one.
Table 2 and
Table 3 demonstrate that the proposed DCT-based approach almost obtains the best accuracy rates and NDSCC scores for all data sets. On the one hand, the traditional PCA algorithm reaches a higher result than the proposed approach only for the Poker and DS1 data set. The fact that the M2 method has the same structure as PCA, and the proposed approach is designed as an incremental learning approach causes PCA to be more successful. On the other hand, the proposed approach achieves better F-score measure. These results demonstrate how precise the proposed approach is, as well as how robust it is in comparison with the traditional PCA. The fact that the M2 method has the same structure as PCA, and the proposed approach is designed as an incremental learning approach, cause PCA to be more successful. The ForestCovType is a huge and sparse data set. It is reported that [
58] processing the data set is difficult. Nevertheless, the proposed DCT-based approach can achieve the best results compared with the other three methods. All approaches have almost the same Acc and the NDSCC scores for data set DS1. The processing of DS1 is easier in comparison with others due to being a synthetic data set. The Acc and NDSCC scores reach a higher percent for four approaches. However, the proposed approach achieves the best results among the four approaches. The reason for the proposed approach’s success is to extract features in the frequency domain. The DCT extracts the best representative and distinctive features in the frequency domain. The use of frequency-domain representation of data streams provides better discrimination of different classes. Consequently, the proposed approach obtains significant results for all data sets. Moreover, PCA, IPCA-Li, and CCFIPCA are linear transformation techniques. The distribution and complexity of data sets are not suitable for transforming data stream linearly.
CIKPCA is an nonlinear data stream feature extraction algorithm. According to
Table 4, CIKPCA has higher results obtained on kernel space for Waveform and Optical-digit. The positive effects of kernel space can be seen in
Table 4. However, the proposed approach is more successful than CIKPCA. This demonstrates that the processing in frequency domain is more efficient than processing in kernel space. Moreover, the concept of data stream can be changed in time. The used kernel type can remain ineffective due to concept drift problem. To decide the best kernel type is a challenge for data stream. In contrast, the concept drift does not affect the proposed DCT-based data stream feature extraction approach.
4.3. The Analysis of the Variation of DCT Coefficients
The effects of dimension reduction and variation of DCT coefficients are examined in this experiment. The experiment is carried out by taking first or last N features from the coefficient vectors called an interval. The intervals are determined as experimentally. The first interval corresponds to the whole DCT coefficient vector. Afterward, the length of intervals is decreased one by one until half of the vector length remained. To do this, the new one is constructed by removing the last element of the previous interval. This process is carried out to see the effect of the last parts (high-frequency components). To observe the effect of the first part (low-frequency components) as opposed to the method of examining the influence of the last elements, the first elements of the previous interval are removed. In both cases, the NDSCC score is used as an evaluation metric, and the scores are obtained by performing M1 on five data sets.
Figure 5 shows the NDSCC scores for five data sets by performing M1. M1-LH and M1-FH refer to the results of the M1 method for the last half and first half. Based on
Figure 5, some interesting observations can be made as follows. It is observed that when the length of the interval is reduced, the NDSCC scores show a tendency to decrease. However, the decrease of NDSCC scores does not occur monotonically for all data sets in some cases. For example, case 6 achieves higher performance than case 7 for ElecNormNews data set on M1-LH. The same situation can be seen in the results of Poker and DS1 data sets according to
Figure 5. The reason is that all the features cannot contribute in the same way. Some features negatively affect the results.
Furthermore, it seems that the influence of the coefficients is tended to exhibit the same behavior for M1-LH and M1-FH on all data sets. When the length of the interval is reduced, the NDSCC scores tend to decrease as expected, and the performance of both intervals becomes different. With the reduction of the interval length, the last coefficients of the DCT vector are tended to be more efficient for data sets ForestCovType and ElecNormNews; the first elements of the DCT vector become efficient for Poker and DS1 data sets. The NDSCC scores vary at different interval levels for all data sets. This situation demonstrates the necessity of automatic feature selection for best representing the characteristics of the data set.
4.4. The Analysis of the Automatic Feature Selection
In this section, an automatic feature selection is evaluated. The purpose is to make a positive contribution to the results by selecting the best representative features and discarding ineffective features from the feature set. PSO and APSO algorithms are implemented to perform an automatic feature selection. The objective function of PSO and APSO is described in
Section 3.1.3.2. The
k value in the objective function is selected as 3 according to the experimental results. In this experiment, only the M1 method is used.
Figure 6 shows the comparison of DCT, PSO-DCT, and APSO-DCT. It is observed from
Figure 6 that there is a slight difference between PSO and APSO. Both automatic feature selection methods can select the best features from the feature set. However, when the number of selected features is reduced, the performance of PSO-DCT is increased on ElecNormNews data sets. The performance of APSO-DCT is better in all cases for Poker data set. Although using only global best value in APSO contribute for the Poker data set in all cases, it does not affect the ElecNormNews positively in all cases. Furthermore, both automatic feature selection methods achieve higher results than experimentally feature selection based DCT according to
Figure 6. This is because the PSO- and APSO-based methods automatically select features by evaluating the structure of the data sets. However, these two methods have a longer learning time disadvantage than DCT.
The last experiment is accuracy and average learning time comparison between the proposed method and IPCA-Ozawa. IPCA-Ozawa performs by increasing the size of the eigenvector space with every incoming data. The size of the eigenvector space is considerably larger than the first example at the end of the process, so the algorithm ends in a gradually increasing time. Therefore, the APSO-DCT method is compared IPCA-Ozawa to show the performing faster than the IPCA-Ozawa even when the DCT algorithm has an additional workload.
Table 5 demonstrates the ACC and learning time of algorithms. It is observed that the DCT-based method with additional load (APSO-DCT) is processed in a shorter average learning time than the IPCA-Ozawa method. Moreover, the proposed approach obtains higher accuracy rates. It can be seen from
Table 5 that the proposed approach is the best method in comparison with IPCA-Ozawa. The gradually increasing time performance of IPCA-Ozawa is not preferable for data stream environment.
Finally, the time complexity of the proposed approach is lower than IPCA as summarized in
Table 5. In another way, the recomputation of the covariance matrix is first required in IPCA algorithms. The eigenvalue and eigenvector are then calculated. In the proposed approach, only the fast-discrete cosine transformation is required. Suppose that the data stream has N attributes. In IPCA N*N, the covariance matrix is first produced then the eigenvalues and eigenvectors are computed. The proposed approach requires only 1-D DCT transformation process for feature extraction. Furthermore, the automatic feature selection step based on PSO/APSO slightly increases the computations. Because PSO/APSO is only performed in the initial phase for a couple of samples to determine an efficient feature set and never be repeated. This does not yield a high computational complexity to the algorithm.
5. Conclusions
Incremental feature extraction approaches are used to facilitate feature extraction from large-scale streaming data. The aim is to address the needs of the data stream for feature extraction. The most popular incremental feature extraction algorithms are the incremental versions of PCA. However, IPCA algorithms have some problems, and these problems make algorithms challenging for the data stream. In this paper, the DCT and swarm intelligence-based feature extraction approach is presented for the data stream as an alternative incremental feature extraction algorithm. The proposed approach has a simple and applicable structure for the data stream. The objective of the proposed study is to demonstrate the superiority of the DCT algorithm to PCA and IPCA algorithms for feature extraction of the data stream. The proposed approach is compared with the traditional PCA, IPCA-Li, CCFIPCA, IPCA-Ozawa, and CIKPCA on six real and synthetic data sets. The experimental results prove the success of DCT-based feature extraction approach and its advantage over PCA and incremental versions. Moreover, DCT-based approach has a less computational cost and time complexity, and so DCT requires less additional workload than PCA and IPCA algorithms. Additionally, the performance of the proposed approach with automatic feature selection is examined. The obtained results confirm the positive effect of automatic feature selection to data stream feature extraction. Therefore, feature selection that considers the structure of the data sets plays an essential role to be obtained higher classification accuracy. Furthermore, although the proposed approach uses an automatic selection mechanism that increases the learning time in the initial phase, the learning time is shorter than Ozawa’s IPCA method.
In this study, the number of the data stream instance using in the learning phase is determined experimentally and constantly for all data sets. As future work, the sample number of the data stream using in the learning phase will be determined dynamically according to the structure of data sets.