Feature Adaptive and Cyclic Dynamic Learning Based on Infinite Term Memory Extreme Learning Machine

Online learning is the capability of a machine-learning model to update knowledge without retraining the system when new, labeled data becomes available. Good online learning performance can be achieved through the ability to handle changing features and preserve existing knowledge for future use. This can occur in different real world applications such as Wi-Fi localization and intrusion detection. In this study, we generated a cyclic dynamic generator (CDG), which we used to convert an existing dataset into a time series dataset with cyclic and changing features. Furthermore, we developed the infinite-term memory online sequential extreme learning machine (ITM-OSELM) on the basis of the feature-adaptive online sequential extreme learning machine (FA-OSELM) transfer learning, which incorporates an external memory to preserve old knowledge. This model was compared to the FA-OSELM and online sequential extreme learning machine (OSELM) on the basis of data generated from the CDG using three datasets: UJIndoorLoc, TampereU, and KDD 99. Results corroborate that the ITM-OSELM is superior to the FA-OSELM and OSELM using a statistical t-test. In addition, the accuracy of ITM-OSELM was 91.69% while the accuracy of FA-OSELM and OSELM was 24.39% and 19.56%, respectively.


Introduction
Machine learning has a wide range of applications in the era of artificial intelligence. Massive data generation and storage facilitate the extraction and transfer of useful knowledge from data, enabling machines to become as smart as humans. Examples of machine learning-based extraordinary technologies include autonomous cars [1], biometric based human identification [2], time series forecasting in different domains [3], security [4], and computer vision [5].
The neural network uses a mathematical structure to gain and store knowledge, which is relevant to machine learning. Furthermore, neural networks can be used for prediction and classification. The classical approach of training neural networks is to provide labeled data, and the use of training algorithms such as backpropagation [6] and machines' extreme learning [7]. For some applications,

Literature Review
In the previous section, we highlighted the problem of emphasized transfer learning. The transfer learning process involves transferring knowledge from one person to another, and is normally conducted between two domains-one that can easily collect data, while the other faces difficulty collecting data [12]. However, transfer learning is valuable when performed in the same domain and a new learner is required, while knowledge is transferred to the previous learner.
In Ref. [13], the authors employed an online approach to quickly adapt the "black box" classifier for the new test data set, without keeping the classifier or evaluating the original optimization criterion. A continuous number was represented by a threshold, which determines the class by considering the original classifier outputs. In addition, points near the original boundary are reclassified by employing a Gaussian process regression scheme. In the context of a classifier cascade, this general procedure that showed performance surpassing state-of-the-art results in face detection based on a standard data set can be employed.
The study presented in Ref. [9] focused on transfer learning for the extreme learning machine (ELM), which put forward a FA-OSELM algorithm allowing the original model's transfer to a new one by employing few data having new features. This approach evaluates the new model's suitability for the new feature dimensions. These experiments showed that the FA-OSELM is highly accurate by employing even a small amount of new data, and is considered an efficient approach that allows practical lifelong indoor localization. Transfer learning integration was done in various fields, such as sensing, computer vision, and estimation. However, these works did not focus on enhancing the classifier's prediction in online mode by considering the learned classifier from the previous block of data. The cyclic dynamic data are a useful application for such a problem, in which previous knowledge can be employed to forecast coming classes due to the fact that cycles occur in sequential data.
The work conducted in Ref. [14] describes a novel type of extreme learning machine with the capability of preserving older knowledge, using external memory and transfer learning; ITM-OSELM. In this study, the authors applied the concept to Wi-Fi localization and showed good performance improvement in the context of cyclic dynamic and feature adaptability of Wi-Fi navigation. However, the approach has not been generalized to other cyclic dynamic scenarios in the machine learning field, and its applicability has not been verified in various types of cyclic dynamics.
All the reviewed articles have aimed at providing their models with some type of transfer learning; however, taking the concept of transfer learning as a need for incremental learning based approaches and enabling restoration of knowledge gained from older chunks of data in the scenarios of cyclic dynamic has not been tackled explicitly in the literature. There is a need for such models in various real life applications like Wi-Fi navigation where people visits previously visited places frequently, also in intrusion detection systems when newer attacks follows behavior similar to older attacks with addition of new features, etc.
The goal of this article is to develop a generalized ITM-OSELM model and to build a simulator for cyclic dynamic. We then used our model to both validate and evaluate ITM-OSELM performance in feature adaptive and cyclic dynamic situations.

Problem Formulation
Given the sequential data x t = (x it ) i ∈ {1, 2, . . . n}, t = 1, 2, . . . T and their corresponding labels y t , y t ∈ {1, 2..C}, the x t dimension is fixed when y t is fixed; y t repeats itself through time. The learning transfer (LT) model transfers knowledge from classifier S t1 to classifier S t2 when d(x t1 ) = d(x t2 ). LT is assumed to have been called for moments t 2 , t 3 due to dimensionality change, where d(x t1 ) = d(x t2 ) and d(x t1 ) = d(x t3 ). The following steps are performed to optimize the performance of S t3 :

1.
Use LT to transfer knowledge from S t2 to S t3 .

2.
Use external memory to add knowledge of S t1 to S t3 .
The first step is responsible for maintaining the knowledge accumulated from training, while the second step is responsible for restoring knowledge lost from the disappearance of old features.    Figure 1 shows the classifier and depicts its evolution according to the type of active features in vector x t . The active features at moment t 1 are 1, 2, and 3, which continue for time T 1 . However, the feature dimension changed to active features 1, 2, 3, 4, and 5 at time t 2 = t 1 + T 1 . This process requires classifier changes from S t1 , where the input vector is (1, 2, and 3) to S t2 with input vectors (1, 2, 4, and 5). Learning transfer was applied to maintain the knowledge gained from S t1 , which ensured the transfer of knowledge related to inputs 1 and 2. Furthermore, features 4 and 5 were new. Hence, new data taught the classifier about these new features. Feature 3 did not undergo learning transfer due to the fact that it was no longer active, and, therefore, external memory (EM) was used to preserve it. Features (1, 2, 4, and 5) were assumed active during time T 2 . Moreover, features 4 and 5 were deactivated during t 3 = t 2 + T 2 , whereas feature 3 was reactivated. Thus, a new classifier was rebuilt on the basis of the following steps:

1.
Perform transfer learning TL to move related knowledge to inputs 1 and 2 from S t2 to S t3 .

2.
Use external memory to move knowledge related to 3 from EM to S t3 .
Classifier S t and external memory EM t are Markov models; they were represented according to the state flow diagram depicted in Figure 2. where the event feature change (FC) moves the system from one state to another.

Methodology
Section 4 presents the methodology we followed in this article. Cyclic dynamic time series generation is provided in Section 4.1, while the transfer learning model is presented in Section 4.2. The external memory model for the ITM-OSELM is discussed in Section 4.3. In Sections 4.4 and 4.5, the ITM-OSELM algorithm and the evaluation analysis of the ITM-OSELM algorithm are presented.

Generating Cyclic Dynamic Time Series Data
The combined record in dataset D is assumed as d k = (x ik , y k ) k = 1, 2 . . . N, i ∈ {1, 2, . . . n} y i , y i ∈ {1, 2..C}. The goal was to build a mathematical model that converts dataset D into a time series dataset D t = (x it , y t ) i ∈ {1, 2, . . . n}, t = 1, 2, . . . N of cyclic nature. Cyclic means that label y t is repeated every time T. x ik , is assumed to have a changing dimension, which is constrained by the condition that the dimension of x t is fixed when y t is fixed. The time series of cyclic dynamic is found based on the following model: In order to elaborate the pseudo-code that is presented in Algorithm 1, we present the equations: where: D denotes the original dataset; y max denotes the maximum code of the classes in D; y t the class that is extracted from D at moment t; y t,i denotes the class y t at the moment t and it is repeated for R times in the time series; and L the number of distinct samples in the time series.
Algorithm 1 exhibits the generation pseudocode. Moreover, the nature of cyclic dynamic can be changed on the basis of value T that represents the period, and value R that represents the number of records in each class. The general form of generation is to change R randomly from one class to another. The pseudocode starts with class generation using the sin function, which is then quantized according to the number of classes in the dataset. The corresponding active features are given from the function Extract(). Finally, the record is provided to the time series data.

Learning Model Transfer
The FA-OSELM has been adopted in the study as the transfer learning model without compromising the generality. Ref. [9] put forward this model that aims to transfer weight values from the old to the new classifier. Assuming that hidden nodes (L) have similar amounts, the FA-OSELM provides an input-weight transfer matrix P, and an input-weight supplement vector Qi, which allow conversion from the old weights a i to the new weight a ′ i , according to the equation considering feature change magnitude from m t to m t+1 : where: The following rules must be followed by matrix P: • One '1' is assigned to each line; the remaining are all '0'; • One '1' is assigned to each column at most; the remaining are all '0'; • P ij = 1 implies that original feature vector's ith dimension will become the jth dimension defining the new feature vector post that the feature dimension has changed.
When an increase emerges in the feature dimension, Q i acts as a supplement. In addition, the corresponding input's weight is added by Q i for the new adding features. The rules mentioned below are part of Q i :

•
Lower feature dimensions imply that Q i can be applied for an all-zero vector. Thus, the new adding features do not need any additional corresponding input weight.
• In cases when an increase emerges in the feature dimension, should the ith item of a ′ i represent the new feature, then a random generation must be applied for the ith item of Q i , which is based on the a i distribution.

External Memory Model for ITM-OSELM
The external memory in ITM-OSELM functions by storing the weights associated with the features, which at a certain time t, become disabled. Classifiers are offered by EM, with weights associated with these features in case features are reactivated. Furthermore, this process verifies the classifiers qualification for prediction through the initial knowledge gained from EM. Provided that the classifier has already gained knowledge from TL, this knowledge is complemented by EM, given that TL employs the previous classifier for feeding EM. However, knowledge associated with new active features is unavailable with the previous classifier. EM structure is shown in Figure 3. Matrix with input size N signifies EM, which denotes the total number of features present in the system. Moreover, L signifies the number of columns, which denotes the number of hidden neurons associated with the classifier. Memory update occurs only when a change emerges in the number of features, which is done by storing weight values for features that become non-active. This memory is employed when changes in the number of features occur when initializing the classifiers for the features' weights that turn active. Figure 3. ITM-OSELM network and its two sources of updates: external memory EM and transfer learning TL.

ITM-OSELM Algorithm
This section presents the ITM-OSELM algorithm. Data are presented to this classifier as chunks in a sequential manner, which have no label once provided. These labels become available after prediction, which permits the sequential training of the classifier. This is a normal OSELM training, although the weights initialization in the OSELM is not fully random. Two sources of information were used: the TL, which is responsible for transferring the weights from previous classifiers; and the EM, which is responsible for providing the weights from the external memory. The old weights must be stored in the EM once a new classifier is created to replace an old classifier. For computational time analysis, the needed time is calculated for the process conducted in each chunk of data. Algorithm 2 exhibits the ITM-OSELM algorithm. Function checkActive() takes two vectors of feature IDs: the first one in the previous moment and the second one in the current moment. This function has another role of comparing the two IDs to determine which features become active or inactive. The role of EMUpdateEM() is to take the current EM and old active features, and save their corresponding weights in the EM. The role of updateNewActive() is to take new memory and new active features, in order to build and restore their weights from the memory.

Evaluation Analysis of ITM-OSELM
This section analyzes the ITM-OSELM and examines the relationship among the characteristics of the times series, such as the number of features and its change rate, number of classes, period of signal on one side, and classifier accuracy on the other side.
Typical evaluation measures for classification were used: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Positive class was selected as any of the classes, while the other classes were regarded as negative. The average of true/false and positive/negative for all classes were used to calculate TP, TN, FP, and FN. Furthermore, accuracy, precision, and recall were calculated on the basis of these measures.

Experimental Work and Results
Cyclic dynamic generator (CDG) results were generated for three datasets, wherein two datasets; TampereU and UJIndoorLoc, belonged to Wi-Fi localization; and one belonged to other machine learning areas; KDD99 from cloud security.
This experiment aimed to generalize the cyclic dynamic concept to other machine-learning fields, where the classifier is required to remember old knowledge that was already gained but might not be preserved due to its dynamic features. The data were repeated for three cycles, and the number of active features in each cycle was not changed, in order to allow investigation of performance changes in the cyclic dynamic situation. Furthermore, accuracy and time results were generated. Section 5.1 provides the datasets description. Model characterization with respect to the number of neurons and regularization factor, as well as the accuracy results are presented in Sections 5.2 and 5.3, respectively.

Datasets Description
The Knowledge Discovery and Data Mining (KDD) competition in 1999 offered tKDD99 dataset [15], which was given by Lee and Stolfo [16]. Pfahringer [17] differentiated such data from others by employing a mixture of boosting and bagging. After winning the first place in a competition, this work has been considered a benchmark by researchers. Generally, these data concern the area of security, specifically intrusion detection, and are therefore considered crucial in machine learning. Output classes are segmented into five divisions: user 2 root (U2R), denial of service (DOS), root 2 local (R2L), probe, and normal. This dataset includes 14 attack types in testing and 24 attack types in training, generating a total of 38 attacks. Theoretically, the 14 new attacks examine the capability of intrusion detection system (IDS) for generalization to unknown attacks, and these new attacks are barely identified by machine learning-based IDS [18].
From Wi-Fi-based localization, two additional datasets, TampereU and UJIIndoorLoc, were employed. Three buildings of Universitat Jaume I were included in the UJIIndoorLoc database, which consist of at least four levels covering an area of almost 110,000 m 2 [19]. For classification applications, this database was used in building identification, regression, actual floor, and actual determination of latitude and longitude. In 2013, UJIIndoorLoc was built in almost 25 Android devices with 20 distinct users. The database includes 19,937 training or reference records and 1111 validation or test records. In addition, the 529 attributes possess Wi-Fi fingerprints, consisting of information source coordinates.
An indoor localization database; the TampereU dataset, was employed for evaluating the IPSs that rely on WLAN/Wi-Fi fingerprint. Lohan and Talvitie developed the database to test indoor localization techniques [20]. Two buildings of the Tampere University of Technology with three and four levels were accounted in TampereU. Moreover, the database includes 1478 reference or training records about the first building, 489 test attributes, and 312 attributes pertaining to the second building. This database also stored the Wi-Fi fingerprint (309 wireless access points WAPs) and coordinates (longitude, latitude, and height).

Characterization
A characterization model is required for each of the classifiers and datasets. This model aims to identify the best model settings in terms of neuron numbers, and the value of the regularization factor. To address this aim, a mesh surface was generated. Every point in the mesh represents the testing accuracy with respect to a certain value of neuron numbers and regularization factor. Figure 4 exhibits the mesh generated from each of the three datasets. Moreover, every point in the surface of the mesh had a different accuracy according to the number of hidden neurons and regularization value. The aim of this study was to select the point with the best accuracy. The regularization parameter was based on the relationship between accuracy and regularization factor (C). The number of hidden neurons was selected on the basis of their relationship with (L), which represents the number of hidden neurons.  To extract the values of C and L, the mesh was mapped to C versus the accuracy curve, or L versus the accuracy curve, respectively. The three dataset curves are shown in Figure 5, while the results for L and C that achieved the best accuracy for each of the three models are given in Table 1. Table 1. Selected values for regularization factor regularization factor (C) and number of hidden neurons (L), and their corresponding accuracy for the three datasets.

Accuracy Results
Accuracy was generated for the study's developed model ITM-OSELM, and, for the two benchmarks, FA-OSELM and OSELM, in addition to the accuracy results. Figures 6-9 represent the detailed accuracy with respect to chunks, and the overall accuracy in each cycle for the three datasets, respectively. Each point in the curve indicates the accuracy of one chunk in the sequential data. The chunks are coming in sequential manner because the data represents a time series data.
Analyzing the curves, for the initial cycle, the ITM-OSELM, FA-OSELM, and OSELM had similar performances because the models did not have previous knowledge to remember. In the second and third cycles, the ITM-OSELM was superior to the others, which was attributed to the comparison among their knowledge preservation aspects. FA-OSELM had transfer learning capability. However, transfer learning is a Markov type, which means it only remembers a previous state and brings its values to the current. ITM-OSELM, however, can restore older knowledge whenever necessary. On the other side, FA-OSELM and OSELM had similar performance regardless of repeating the cycle.
---- Figure 6. Accuracy change with respect to data chunks for three cycles using the TampereU dataset.
----    For further elaboration, we present Table 2, which provides the numerical values of the accuracies of each cycle for each of the three models. We observe that ITM-OSELM has achieved the highest accuracies for the three datasets in the second and third cycles. The best achieved accuracy for ITM-OSELM has been achieved in UJIndoorLoc where the overall accuracy in the third cycle was 91.69% while the accuracy of FA-OSELM and OSELM was 24.39% and 19.56%, respectively, for the third cycle. This emphasizes the superiority of ITM-OSELM over FA-OSELM and OSELM. On the other side, we observe the increase of the learning performance in ITM-OSELM when the accuracy has been increased from 16.48% in the first cycle to 88.36% in the second and 91.69% in the third cycle.  Table 3 was generated, where ITM-OSELM's learning capability performance was quantified and compared to that of FA-OSELM and ITM-OSELM, from one cycle to another. This comparison included the learning improvement percentage during all cycles. The highest learning percentage from one cycle to another was achieved for ITM-OSELM in all three datasets; thus, ITM-OSELM was the best in terms of gaining knowledge from one cycle to another. The other two models showed negative learning rates; hence, they were not capable of carrying knowledge from one cycle to another. Moreover, the ITM-OSELM achieved better learning improvement in Cycles 2 to 1 of 182.52% compared to Cycles 3 to 1 of 0.54%, yet this improvement was not due to less capability but performance saturation as accuracy reached~100% in the Cycle 3. In order to validate our hypothesis of superiority of ITM-OSELM over OSELM and FA-OSELM, we adopted a t-test using a confidence level of 0.05% for rejection of H 0 of non-statistical difference in performance. In Table 4, we see in all cells of cycle 2 and cycle 3 that the values of the t-test were lower than 0.05, which means that ITM-OSELM outperforms the two baseline approaches OSELM and FA-OSELM. This proves that the reason of the superiority in the model is the transfer learning and external memory that enabled ITM-OSELM to restore old knowledge in both cycle 2 and cycle 3, while both OSELM and FA-OSELM could not do it. In addition to accuracy, the models were evaluated according to standard machine learning evaluation measures. This evaluation was performed by selecting a class and assuming it as positive, and then the classifier was checked for any sample. The results of the classifier were either positive or negative. Figures 10-12 display TP, TN, FP, or FN results. For ITM-OSELM, true measures have an increasing trend from one cycle to another, whereas false measures have a decreasing trend from one cycle to another. This discrepancy does not apply to FA-OSELM and OSELM, and the result was normal considering the accuracy results. An interesting observation is that the first cycle provides nearly similar values of TP, TN, FP and FN for all three of the models while the deviation between ITM-OSELM and the other two models occur in both the second and third cycle, which support its capability of building knowledge and achieving a higher rate of correct predictions. Classification measures with respect to cycles for the three models with the TampereU dataset.

Figure 11.
Classification measures with respect to cycles for the three models with the UJIndoorLoc dataset.

Conclusions and Future Work
Cyclic dynamics is a common type of dynamics that occurs in time series data. Typical machine learning models are not meant to exploit learning within cyclic dynamic scenarios. In this article, we developed two concepts: first, a simulator was developed for converting datasets to time series data with changeable feature numbers or adaptive features, and for repeating the cycles of output classes; second, we developed a novel variant of the OSELM called the ITM-OSELM to deal with cyclic dynamic scenarios, and time series. The ITM-OSELM is a combination of two parts: transfer learning part, which is responsible for carrying information from one neural network to another when the number of features change; and an external memory part, which is responsible for restoring previous knowledge from old neural networks when the knowledge is needed in the current one. These models were evaluated on the basis of three datasets. The results showed that the ITM-OSELM achieved improvement in accuracy over the benchmark, where the accuracy of ITM-OSELM was 91.69%, while the accuracy of FA-OSELM and OSELM was 24.39% and 19.56% respectively.
The future work is to investigate the applicability of ITM-OSELM in various machine learning fields like video based classification or network intrusion detection. Furthermore, we will investigate the effect of the percentage of feature change in consecutive cycles on the performance of ITM-OSELM.

Conflicts of Interest:
The authors declare no conflicts of interest regarding this paper.