3.2. Data Preprocessing
Generally, we can perform intrusion detection through data analysis. The data is classified into two categories: continuous data and discrete data. Here, different data categories should be performed with different operations.
For continuous data, to reduce the impact of the different data dimensions on experimental results, the data is usually normalized before data reduction and classification. Normalization is the process of transforming a dimensionless expression into a dimensionless expression. When using principal component analysis (PCA) or the LDA algorithm for data dimensionality reduction, a covariance calculation is often involved. Among them, the Zscore normalization method can eliminate the effects of dimensional variance and covariance. It performs better than other normalization methods, so we use the Zscore normalization method. Its formula is
where
x is the original data input amount;
z is the normalized data output; and
$\mu $ and
$\sigma $ represent the mean and variance of each dimension of the original dataset, respectively.
For discrete data, we use OneHot coding for processing, which can expand the data features. It not only improves the nonlinear capability of the algorithm model, but also does not require normalization of multiple parameters.
3.3. The Proposed Algorithm
The basic idea of ILECA is to use the similarity measure function of highdimensional data space as the weight to improve the betweenclass scatter matrix. At the same time, ILECA combines with LDA to maximize the betweenclass distance and minimize the withinclass distance, so as to obtain the optimal transformation matrix and reduce the dimensions of the original data. Finally, ILECA combines with the ELM classification algorithm to classify the data and determine the security of the IoT devices.
Given a set D containing N train samples, $D=\left\{{x}_{k},{t}_{k}\right\},k=1,2,\cdots ,N$. Suppose ${x}_{ij}\in \left\{{x}_{k}\right\},{t}_{ij}\in \left\{{t}_{k}\right\},i=1,2,\cdots ,c,j=1,2,\cdots ,{n}_{i}$, ${x}_{ij}$ is the jth sample feature vector of the ith class, and ${t}_{ij}$ is the sample label corresponding to ${x}_{ij}$, where the sample feature is d dimension, then the total sample feature matrix can be expressed as ${X}^{N\times d}$; the sample has a total of c types; and ${n}_{i}$ represents the number of i class of samples, i.e., $N={\sum}_{i=1}^{c}{n}_{i}$.
The total sample mean vector
u and the class mean vector
${u}_{i}$ of the
ith sample are, respectively,
Moreover, the withinclass scatter matrix, betweenclass scatter matrix, and transformation matrix objective functions are defined as follows.
Definition 1. The withinclass scatter matrix ${S}_{w}$ is expressed as The withinclass scatter matrix is the mean square error of the distance between each class of sample with its center, and represents the degree of dispersion of the same class of sample.
Definition 2. The betweenclass scatter matrix ${S}_{b}$ is expressed aswhere ${f}_{ij}$ is a highdimensional data spatial similarity measurement function, which represents the spatial similarity of data ${\mu}_{i}$ and ${\mu}_{j}$; ${\mu}_{i,k}$ and ${\mu}_{j,k}$ represent the mean values of data i and j in k dimensions, respectively; d is the feature dimension of data; and ${n}_{i}$ and ${n}_{j}$ represent the number of samples of class i and j, respectively. The betweenclass scatter matrix ${S}_{b}$ reflects the average of the distances between the centers of various classes with different spatial similarities and the center of the total sample. ${S}_{b}$ represents the dispersion between classes. The range of the highdimensional data spatial similarity measurement function is (0, 1].
Definition 3. The objective function of the optimal transformation matrix ${A}^{*}$ is expressed aswhere A is the projection matrix and I is the identity matrix. According to the extreme value of generalized Rayleigh quotient, calculate the eigenvectors ${a}_{1},{a}_{2},\cdots ,{a}_{m}$ corresponding to the first m eigenvalues ${\lambda}_{1}>{\lambda}_{2}>\cdots >{\lambda}_{m}$ of ${I}^{1}({S}_{b}{S}_{w})$, and combine them into a matrix to obtain the optimal transformation matrix ${A}^{*}$, $m=c1$. Finally, the dimensionalityreduced sample feature vector is obtained through matrix calculation:where ${y}_{k}$ is the corresponding feature vector after the dimensionality reduction of the feature vector ${x}_{k}$, and the dimensionalityreduced sample feature matrix is expressed as ${Y}^{N\times m}$. After the dimensionality reduction, N samples are obtained and transformed into sample sets with new features ${D}^{{}^{\prime}}=\{{y}_{k},{t}_{k}\},k=1,2,\dots ,N$, where ${y}_{k}={[{y}_{k1},{y}_{k2},\cdots {y}_{km}]}^{T}$ is the mdimensional feature vector of the dimensionalityreduced data, ${t}_{k}={[{t}_{k1},{t}_{k2},\cdots {t}_{kc}]}^{T}$ is the sample label, and the samples have c classes.
As shown in
Figure 3, the new sample set obtained after dimensionality reduction is input into a single layer neural network. For the single hidden layer neural network with
L hidden layer nodes, it can be expressed as
where
${w}_{i}={[{w}_{i1},{w}_{i2},\cdots ,{w}_{im}]}^{T}$ is the input weight between the
ith hidden layer node and the input layer node,
${b}_{i}$ is the offset of the
ith hidden layer node,
${\beta}_{i}$ is the output weight between the
ith hidden layer node and the output layer node,
$g\left(x\right)$ is the activation function, and
${w}_{i}^{T}\xb7{y}_{k}$ is the inner product of
${w}_{i}^{T}$ and
${y}_{k}$. The input weight
${w}_{i}$ and offset
${b}_{i}$ in the function are random numbers between (−1, 1) or (0, 1).
To minimize the output error and the label error of the corresponding sample data, an objective function is established as
which is
The above
N equations can be expressed by a matrix as
where
H is the output matrix of the hidden layer nodes,
$\beta $ is the output weight matrix, and
T is the expected output.
According to Equation (
13), as long as the input weight
${w}_{i}$ and the offset
${b}_{i}$ are randomly determined, the output matrix
H is uniquely determined. The Moore–Penrose generalized inverse matrix
${H}^{\u2020}$ of
H is used to analyze and determine the leastsquares solution
$\beta $ of the smallest norm [
25,
26]
It can be seen from the Equation (
14), to obtain better generalization, the positive value
$I/\phantom{IC}\phantom{\rule{0.0pt}{0ex}}C$ is added to the diagonal of
$H{H}^{T}$ or
${H}^{T}H$. Then, it can repair the matrix and ensure that it is a full rank matrix. Therefore, the classifier training process is given as follows.
As shown in
Figure 4, we reduce the train data to generate a transformation matrix
${A}^{*}$, and input the dimensionalityreduced train data into the ELM classifier to calculate the final weight
$\beta $. Then, let input the dimensionalityreduced test data to the ELM classifier for classification, and finally output the prediction results of the test data.
Figure 5 shows the flow chart of ILECA. The specific process of ILECA is described as follows, and the ILECA pseudocode is shown in Algorithm 1.
Algorithm 1: ILECA 
Input: train set $D=\left\{{x}_{k},{t}_{k}\right\},k=1,2,\dots ,N$, test set $DT=\left\{T{x}_{k},T{t}_{k}\right\},k=1,2,\dots ,{n}_{t}$ Output:
expected classification matrix T 1:
formulate the feature matrix X for D  2:
$X=Zscore\left(X\right)$  3:
calculate ${S}_{b},{S}_{w},{S}_{b}{S}_{w}$  4:
obtain ${A}^{*}$ by solving the eigenproblem of ${I}^{1}({S}_{b}{S}_{w})$  5:
calculate $Y=X{A}^{*}$, obtain the new train data ${D}^{{}^{\prime}}=\left\{{y}_{k},{t}_{k}\right\}$  6:
generate ${w}_{i}$ and ${b}_{i}$ randomly, set the number of hidden neurons L  7:
calculate the output of hidden neurons H according to the Equation ( 13)  8:
calculate the output weight of classifier $\beta $ according to the Equation ( 14)  9:
formulate the feature matrix ${X}_{t}$ for $DT$  10:
${X}_{t}=Zscore\left({X}_{t}\right)$  11:
${Y}_{t}={X}_{t}{A}^{*}$  12:
calculate the output of hidden neurons ${H}_{t}$ for test data according to the Equation ( 13)  13:
$T={H}_{t}\beta $ according to the Equation ( 12)  14:
returnT

Step 1: Perform Zscore normalization on the train samples according to Equation (
1).
Step 2: Calculate the withinclass scatter matrix
${S}_{w}$ according to Equation (
4), and calculate the betweenclass scatter matrix
${S}_{w}$ according to Equation (
5).
Step 3: Establish the objective function according to Equation (
7), calculate
${I}^{1}({S}_{b}{S}_{w})$, and decompose the characteristic problem to obtain the eigenvalues and eigenvectors. Take the eigenvectors corresponding to the first
m largest eigenvalues as the transformation matrix
${A}^{*}$,
$m=c1$.
Step 4: Calculate
$Y=X{A}^{*}$ according to Equation (
8), and obtain the new train data
${D}^{{}^{\prime}}=\left\{{y}_{k},{t}_{k}\right\}$.
Step 5: Generate ${w}_{i}$ and ${b}_{i}$ randomly, and set the number of hidden neurons L.
Step 6: Calculate the output of hidden neurons
H according to the Equation (
13).
Step 7: Calculate the output weight of classifier
$\beta $ according to the Equation (
14).
Step 8: Calculate ${Y}_{t}={X}_{t}{A}^{*}$.
Step 9: Calculate the output of hidden neurons
${H}_{t}$ for test data according the to Equation (
13).
Step 10: Calculate the output for test data by Equation (
12) with
${H}_{t}$ and
$\beta $.