3.1. Data Preprocessing
The collection of IoT terminal device traffic aims to identify device behavior by analyzing communication data, enhancing system security and performance management, and is an essential means to ensure the normal operation and data security of IoT terminal devices. To this end, traffic information from IoT terminal devices in the IoT environment was collected by using tools such as Wireshark, as illustrated in
Figure 2, which depicts the IoT terminal device traffic collection process. In this study, the IoT terminal device traffic datasets employed in [
12,
26] were primarily used to accomplish the goal of IoT terminal device identification. The raw IoT terminal device traffic in the dataset was stored in the pcap file format and had the following features:
The raw IoT terminal device traffic file contained multiple traffic packets from these devices. Therefore, the file size was large, which increased the processing time required for data preprocessing.
Given the large amount of redundant information in the IoT terminal device traffic packets, directly using the raw traffic data as the raw features would reduce the recognition efficiency of the IoT terminal device identification model.
The data in the IoT terminal device traffic file was stored in the hexadecimal format, which cannot be directly used as input for neural network models. Synthetic image generation is more mature and stable than generating long, discrete 1D sequences, ensuring higher quality samples for balancing the dataset.
By leveraging CNN architectures (like ResNet18) on image data, the model gains translation invariance. This means it can recognize specific device fingerprints even if their position shifts slightly within the payload.
Therefore, a data preprocessing algorithm based on the features of IoT terminal device traffic files was proposed, and its work flow is shown in
Figure 3. The data-preprocessing method consisted of three main parts: traffic segmentation, feature selection, and grayscale image generation. The pseudo-code of data preprocessing is shown in Algorithm 1. It defines the original IoT terminal device traffic as
T; the file
T is in pcap format. The sample set of IoT terminal device traffic files was obtained after data preprocessing. In the Payload-to-Image conversion stage, raw packets are normalized and mapped into grayscale pixel values. This process functions as a form of lossy abstraction, making it computationally difficult to reconstruct the original sensitive plaintext from the generated images.
| Algorithm 1: DPP Algorithm |
-
Input: T//Original IoT terminal device network traffic file -
Output:
- 1
SplitCap(T)→ - 2
for in t do - 3
lists = rdp() - 4
for each list in lists do - 5
if list.haslayer(TCP) then - 6
Payload = list[TCP].payload.original - 7
end - 8
if list.haslayer(UDP) then - 9
Payload = list[UDP].payload.original - 10
end - 11
temp = HexToDec(Payload) Payload_list.append(temp) - 12
end - 13
for each plist in Payload_list do - 14
if len(plist) then - 15
File_list.append(plist[0: ]) - 16
end - 17
if len(plist) < then - 18
File_list.append(pad(plist, )) - 19
end - 20
end - 21
for each file in File_list do - 22
→ → - 23
end - 24
end - 25
return
|
3.1.1. Flow Segmentation
To extract the traffic files of different devices from the hybrid IoT terminal device traffic files, they should be segmented based on the quintuple information in the IoT terminal device traffic data packets. The SplitCap tool was utilized in this study to achieve this goal. Traffic segmentation enables IoT terminal device identification models to quickly and accurately process the traffic data from different IoT terminal devices.
3.1.2. Feature Selection
An in-depth analysis of the raw IoT terminal device traffic was conducted to improve the accuracy and efficiency of the IoT terminal device identification model. The payload information in the traffic packets from IoT terminal devices was fully utilized because the raw traffic data features plays an important role in improving the recognition performance of the method. The reasons for this are as follows:
IoT terminal devices usually use specific protocols and application layer data formats when communicating with each other. In this process, the payload information contains identifiable patterns and data structures related to different IoT terminal devices. Analyzing this information in depth helps in efficiently distinguishing the communication features between different IoT terminal devices.
Different IoT terminal devices may apply communication protocols in unique ways, resulting in differences in the payload information in the IoT terminal device traffic. Adequate consideration of these differences allows for more accurate identification of IoT terminal device types and IoT terminal device manufacturer information, which can be used to improve the accuracy of the overall IoT terminal device identification model.
The payload information from the IoT terminal device traffic data packets can be used to detect potential security threats to the IoT terminal device or any abnormal behavior of the device. By monitoring this information, the potential security risks can be identified in a timely manner, thus enhancing the security of the IoT system.
Overall, the payload information from the raw IoT terminal device traffic was selected as the raw traffic feature device identification to shorten the identification time.
3.1.3. Grayscale Image Conversion
The payload data extracted from the feature selection of the IoT terminal device traffic are not directly available as input to the deep learning models. This is because payload data are stored in the hexadecimal format in the IoT terminal device traffic and need to be converted to decimal format first.
Furthermore, as depicted in
Figure 3, payload data lengths vary in the decimal format in the network traffic packets. To ensure the input consistency of the IoT terminal device identification model, the following steps were implemented: first, the payload information data were converted into the decimal format. Next, the length thresholds of the payloads were set to
depending on the payload data length in the different IoT terminal device traffic datasets displayed in
Figure 4 and
Figure 5. The payload lengths of the different data packets in the IoT terminal device traffic were processed as follows (
Figure 6):
When the payload data length of the packet in the IoT terminal device traffic was greater than , the payload information was clipped.
When the payload data length of the packet in the IoT terminal device traffic was less than , a zero-fill operation was performed on the payload information. Finally, the processed payload data in the terminal device traffic with unified lengths were converted into grayscale images, where the pixel values were mapped one by one to complete the grayscale image transformation. This process ensured consistency of the input data length of the IoT terminal device identification model, providing a uniform input for accurate identification of the IoT terminal device model at the specific device model level.
3.2. Data Augmentation
With the completion of data preprocessing, IoT terminal device traffic in the dataset was successfully converted from pcap file format to traffic data samples with uniformly formatted grayscale image format. However, different IoT terminal devices come from different manufacturers and have unique functions, hardware and software compositions, which led to huge differences in the size of the network traffic they generate. For example, network traffic generated by surveillance devices (e.g., cameras) contained a large number of network traffic data packets, whereas traffic generated by other sensory devices contained few packets. In addition, in the smart home domain, the network traffic generated by devices such as smart light bulbs and smart sockets contained a smaller number of network traffic data packets. As a result, there were large differences in the number of IoT terminal device traffic data samples produced from different IoT terminal device traffic on performing data preprocessing.
When traffic data samples from IoT terminal devices are fed into a neural network model for training, it is often assumed that the model treats these samples as coming from a real distribution pattern and considers them unbiased. Therefore, a significant difference in the number of IoT terminal device traffic samples in the training set might have an impact on the results of the IoT terminal device recognition model. Specifically, IoT terminal devices that feature a large number of traffic samples are more likely to be correctly identified by the model, whereas those with relatively few traffic samples are more prone to being misidentified as devices with a higher volume of traffic samples. Therefore, to address the problem of unbalanced traffic samples for IoT terminal devices, the dataset generated via data preprocessing has been adjusted to improve the identification accuracy.
GAN was used to generate additional traffic samples for IoT terminal devices with a relatively small sample size in the dataset. GAN is a deep learning framework proposed by Ian Goodfellow in 2014. It is unique because it is based on two mutually adversarial neural network models: the generator and the discriminator. The network is based on the concept of “adversarial games” in game theory; during the training process, the generator and the discriminator compete against each other similar to two players. In this process, the goal of the generator is to produce synthetic samples that are similar to real traffic samples, whereas the task of the discriminator is distinguishing between real and synthetic samples as accurately as possible. This competitive dynamic enables the generator to gradually learn to generate more realistic traffic samples to confuse the discriminator, whereas the discriminator continuously improves its ability to effectively distinguish between genuine and fake samples.
We employed the NTGAN-based augmentation rather than simpler methods like random over-sampling (ROS) for several reasons:
Preventing Overfitting: Simple over-sampling replicates existing minority samples, which often leads to overfitting as the model merely memorizes specific data points. In contrast, NTGAN generates new, synthetic samples that follow the underlying distribution of the minority class, enhancing the model’s generalization.
Preserving Feature Diversity: Unlike SMOTE or ROS, which may introduce noise or blur in the image domain, NTGAN captures the high-dimensional latent patterns of IoT traffic images. It ensures that the generated ’synthetic traffic’ maintains the structural integrity and statistical characteristics required for precise fingerprinting.
Improved Stability: By incorporating a loss-threshold filtering mechanism, NTGAN filters out low-quality or non-representative samples, ensuring that the augmented dataset is both balanced and high-fidelity.
GAN was used in this study to generate additional traffic samples for balancing the distribution of traffic samples of IoT terminal devices in the dataset, thereby improving the performance of identifying devices with a low number of samples. The input to the generator was arbitrary noise,
, and a batch of spurious data,
, was generated. The loss function during generator training was
. The inputs to the discriminator were spurious data samples generated by the generator and real data samples, and the task of the discriminator was to distinguish whether the data was from the real sample dataset or the spurious sample set from the generator. Assume that the discriminator input parameter is
x,
x is from
, the output of the discriminator,
, is the probability of belonging to a real data sample, and the loss function generated during the training of the discriminator is
. The goals of the generator and discriminator are to generate data samples that are as realistic as possible through continuous adversarial actions, i.e., the generator generates spurious data samples that are as realistic as possible to deceive the discriminator, whereas the discriminator learns to more accurately recognize the deceptive data in
x. The objective function
V, of the GAN is expressed in Equation (
1):
When NTGAN is trained, the loss values,
and
, of the generator and discriminator, respectively, are calculated by applying Equations (2) and (3):
where
m is the number of samples in each batch during the training process,
is the noise vector obtained by sampling the random noise distribution of the generator,
is the output of the discriminator for the data produced by the generator,
are the number of real data samples, and
is the output of the discriminator for the data produced by the generator.
In summary, the NTGAN algorithm, i.e., the data enhancement module, as shown in
Figure 7, was proposed to address the issue of insufficient sample numbers for certain devices in the IoT terminal device traffic sample dataset.
The standard GAN has difficulties in generating IoT terminal device traffic sample data for low traffic sample devices. Because the GAN does not learn sufficient number of traffic features from the IoT terminal devices, the realism of the generated samples needs improvement to accurately mimic the actual ones. To solve this issue, in this work, the standard GAN method was modified.
First, for IoT terminal devices with different sample numbers (i.e., devices with low sample numbers), the loss value thresholds, , of the generator and the discriminator were set separately. During NTGAN training, when the loss values of both the generator and the discriminator were less than the thresholds, traffic samples generated by the generator were merged with the real traffic samples so that the NTGAN can learn more information about the traffic features of these IoT terminal devices. This threshold mechanism helps ensure that only high-quality, stable synthetic samples are added to the training dataset, preventing performance degradation due to noisy or unrepresentative data. Finally, after training on a certain number of batches, the generator produced traffic samples for IoT terminal devices to balance the number of samples in the dataset. This data augmentation approach increased the quality of the generated samples by introducing thresholds, thus enhancing the ability of the model to learn the traffic features of IoT terminal devices with a small number of samples. Moreover, this strategy reduces the risk of overfitting to the majority class by offering more diverse and representative training data for minority-class devices.
The NTGAN training process is outlined in detail in Algorithm 2. Initially, the raw traffic data from IoT terminal devices, which is in pcap format, undergoes preprocessing to convert it into a set of traffic sample files in png format. These png files serve as input data for the NTGAN module. The primary role of the NTGAN module is to address class imbalance in the dataset by generating additional synthetic samples specifically for the IoT terminal devices that are under-represented. By creating these new samples, NTGAN effectively augments the dataset, ensuring a more balanced distribution of classes. In doing so, the module improves the representational fairness across all device categories, thereby contributing to a more robust and generalizable identification model.
At the conclusion of the process, the output is a comprehensive IoT terminal device traffic dataset, stored as an image file named File_png, which encapsulates the balanced traffic data for subsequent analysis or model training. Let the loss value of the generator be
and the loss value of discriminator be
.
| Algorithm 2: NTGAN Algorithm |
-
Input: , , , , , -
Output:
- 1
for each in do - 2
if then - 3
- 4
- 5
Set generator_model and discriminator_model parameters, Epoch, Batch_size, etc. - 6
for each epoch in do - 7
for each in do - 8
Generate 100 dimensions of random noise - 9
Input noise into to generate false samples - 10
Input true samples and false samples to the - 11
Calculate loss value with equation (2) - 12
Calculate loss value with equation (3) - 13
Update generator and discriminator parameters based on their respective loss values - 14
if and then - 15
Use the to generate 1000 samples and append to - 16
end - 17
end - 18
end - 19
Use the to generate samples and append to - 20
end - 21
end - 22
return
|
3.3. ResNet18-BiLSTM Model for IoT Terminal Device Identification
The raw IoT terminal device traffic features input into the Softmax classifier (a common multi-class classification function) cannot efficiently and accurately identify IoT terminal devices. To address this problem, a neural network model must be employed to mine the deep features in the raw IoT terminal device traffic. In this study, a ResNet18-BiLSTM IoT terminal device identification model is proposed.
Figure 8 shows the neural network structure of the ResNet18-BiLSTM model. First, the ResNet18 structure in ResNet18-BiLSTM model is used to deeply mine the spatial features in IoT terminal device traffic. Subsequently, the BiLSTM structure is used to extract the temporal features in the IoT terminal device traffic. Finally, the Softmax classifier is used to identify the IoT terminal devices.
To fully bridge the spatial feature representation of ResNet18 and the sequential modeling of BiLSTM, the output tensor of the ResNet18 backbone, with a shape of , is first processed by an AdaptiveAvgPool2d layer. This pooling operation effectively downsamples the spatial dimensions to to optimize computational efficiency and control the sequence length. Subsequently, the pooled tensor is reshaped by flattening the spatial dimensions into a sequence, resulting in a tensor shape of , which is then fed into the BiLSTM layer. After the recurrent processing of the sequence, a generic flatten operation is applied to convert the sequential hidden states into a single-dimensional feature vector, which is finally mapped to the class probabilities via the densely connected layer.
3.3.1. ResNet18
After the raw network traffic data generated by the IoT terminal devices were preprocessed and enhanced, the obtained data was stored as a grayscale image. In this study, the ResNet18 model was used to fully exploit the spatial features in the traffic of the IoT terminal devices. ResNet18 is a deep convolution-based neural network model. Unlike the traditional convolutional neural network structure, the ResNet18 mainly consists of an input layer, a convolutional layer, a pooling layer, a global average pooling layer, a residual block, and a fully connected layer. In this study, the input layer in the ResNet18 provided the raw IoT terminal device traffic data. The convolutional and pooling layers were used to extract low level features from the IoT terminal device traffic. The global average pooling layer was located at the top of the ResNet18 and was used to reduce the dimensions of the feature maps by converting them into vectors.
In addition, the ResNet18 introduces a residual block structure to solve training problems such as gradient disappearance and explosion. Residual blocks are mainly of two types: basic blocks and bottleneck blocks. Basic blocks consist of convolutional layers, batch normalization layers, rectified linear unit (ReLU) activation functions, and jump connection structures. The convolutional kernel size of the convolutional layer is usually
and is often used with a bulk normalization layer and ReLU activation function. The jump-junction structure is implemented by adding the inputs directly to the outputs and is used to retain more low-level features, helping to avoid the gradient vanishing problem of the ResNet18. The basic block is commonly used in shallower ResNet18 models. The bottleneck block usually consists of
,
, and
convolutional layers, a batch normalization layer, an ReLU activation function, and a jump-junction structure. This structure reduces the number of parameters while increasing the depth of the neural network and is commonly used in ResNet18 models with deeper networks. In this study, the basic block structure in the residual block was used to mine the spatial features at a deeper level of the IoT terminal device traffic.
Figure 9 shows the residual block network structure used in this work. Compared with the standard convolutional neural network, the ResNet18 can learn spatial features at a deeper level in the IoT terminal device traffic and improve the accuracy of the IoT terminal device identification model.
3.3.2. BiLSTM Neural Network
LSTM is type of recurrent neural network. The difference between LSTM and a standard recurrent neural network is that LSTM avoids the problems of gradient explosion and gradient vanishing by storing cell states and a gate structure. It performs well in extracting temporal features from sequential data. BiLSTM is a bidirectional recurrent neural network. Compared with LSTM, BiLSTM can extract the bidirectional time series features of IoT terminal device traffic. The BiLSTM structure consists of input, forget, and output gates, memory cells, and hidden states. At time
t, the forget gate (
), input gate (
), and output gate (
) are calculated as shown in Equations (4)–(6), respectively:
where
,
, and
and
,
, and
are the weight matrices and neural network bias values of the forget, input, and output gates, respectively;
denotes the hidden state at time
;
is the activation function; and
denotes the input traffic feature information at time
t.
At time
t, multiplying the forget gate value,
, by the previous cell state,
, and adding to the value of cell state,
, multiplied by the memory gate value, we obtain the current cell state,
, as shown in Equations (7) and (8).
The unimportant IoT terminal device traffic information transmitted by the unit at time
is filtered out by the forget gate. Subsequently, the important IoT terminal device traffic information at time
t is extracted by the input gate. Finally, the hidden state,
, at time
t is calculated by the output gate, as shown in Equation (
9).
BiLSTM combines the forward LSTM hidden states,
, and reverse LSTM hidden states,
, to generate a new hidden state,
, which is computed as shown in Equation (
10).
3.3.3. ResNet18-BiLSTM
This study integrated ResNet18 and BiLSTM neural networks to capture temporal and spatial features of IoT terminal device traffic, proposing the ResNet18-BiLSTM identification model. The pseudo-code for this model is shown in Algorithm 3.
Initially, raw IoT traffic samples were processed with the NTGAN module to generate additional samples for under-represented devices, mitigating data imbalance and yielding the standardized dataset, File_png, in grayscale image format. This dataset was randomly separated into training (
), validation (
), and test sets (
) in a 6:2:2 ratio. Model training followed by setting hyperparameters and iteratively processing
and
batches, with
used for training and
for validation. Ultimately, the trained IoT device identification model
M was obtained.
| Algorithm 3: ResNet18-BiLSTM Model |
-
Input: , , , , , -
Output: M
- 1
Set model parameters: , , , etc.; - 2
for each epoch in (1, ) do - 3
; - 4
for each batch_size in X do - 5
for each data point in batch_size do - 6
Compute convolution with 7 filters - 7
Use Batch Normalization; - 8
Run through Max Pooling layer; - 9
Run through make_layer 1 4; - 10
Run through AdaptiveAvgPool2d layer; - 11
Reshape spatial dimensions into a sequence length; - 12
Use Equation (4) to calculate the forgetting gate at time t; - 13
Use Equation (5) to calculate the input gate at time t; - 14
Use Equation (6) to calculate the output gate at time t; - 15
Use Equations (7) and (8) to calculate the unit state at time t; - 16
Use Equations (9) and (10) to calculate the hidden state at time t; - 17
Run through flatten layer; - 18
Run through a densely connected layer; - 19
end - 20
end - 21
ResNet18-BiLSTM() // Evaluate validation set using the ResNet18-BiLSTM model; - 22
end - 23
Save M // Save the trained IoT device recognition model; - 24
return M
|