5.1. Experimental Setup and Evaluation Metrics
To evaluate the effectiveness of our proposed SEED approach, we utilized the widely adopted Edge-IIoTset dataset [
34], which contains traffic data from IoT and IIoT systems, including 14 distinct attack types along with normal traffic, yielding 15 classes in total.
We designed three evaluation scenarios:
Binary classification scenario: Distinguishes between benign and anomalous traffic.
Six-category scenario: Groups attacks into six intermediate categories (five attacks + Normal traffic).
Fifteen-class scenario: Performs fine-grained classification across all 14 attack types plus normal traffic.
The first two scenarios (binary and six-category) are intended for execution at the IoT level, leveraging the compact neural network, while the fifteen-class scenario is handled by the large edge-based model due to its higher computational requirements.
Figure 4 illustrates the hierarchical mapping of attack techniques included in the dataset to broader threat categories and finally to anomaly classifications. The visualization demonstrates how individual attacks are grouped into intermediate categories, which are then aggregated into the final classification outputs.
The sive-category scenario is particularly relevant at the IoT level because it captures the most critical IoT threats while remaining computationally efficient for edge devices. This approach balances fine-grained detection with the limited memory and processing capabilities of IoT hardware, providing comprehensive coverage of major IoT threat vectors without overwhelming resource-constrained devices.
Table 3 summarizes the configurations of our SEED models. The EdgeBERT model consists of 4 encoder layers with 256 hidden units, 512 intermediate units, 4 attention heads, and 2 fully connected layers, resulting in over 10 million parameters [
23]. The IoT-level classifiers are compact, consisting of two fully connected layers with batch normalization and dropout; the binary and five-category models contain only 33k and 34k parameters, respectively, making them suitable for deployment on resource-constrained IoT devices.
For the Edge-Level model training, we utilized a high-performance server equipped with an NVIDIA A100 GPU with 80 GB of memory, 12 CPU cores, and a total computing power of 312 TFLOPs. To accelerate experiments related to IoT-level model training, we employed an NVIDIA T4 GPU with 16 GB of RAM, 4 CPU cores, and a peak performance of 65 TFLOPs. It is worth noting that the T4 was used solely to speed up training; the IoT models are lightweight enough that they can also be trained on a standard CPU, with longer training times.
To evaluate the performance of our SEED models, we used standard classification metrics including accuracy, precision, recall, and F1-score. These metrics are defined as follows:
Accuracy measures the proportion of correctly classified instances among all samples:
Precision quantifies the proportion of correctly predicted positive instances among all instances predicted as positive:
Recall measures the proportion of correctly predicted positive instances among all actual positive instances:
F1-score is the harmonic mean of precision and recall, providing a balanced metric for classification performance:
Here, , , , and represent the number of true positives, true negatives, false positives, and false negatives, respectively.
5.2. Results and Discussion
The evaluation begins with assessing the performance of the EdgeBERT model. It was trained on the entire training set of the Edge-IIoTset dataset for approximately 15 min on an NVIDIA A100 GPU, corresponding to one epoch. For comparison, training the full BERT-base model on the same dataset would have required an estimated 3.5 h per epoch. After training, EdgeBERT achieved nearly 100% accuracy.
Figure 5 shows the variation of loss and accuracy every 1000 mini-batches during training. Due to the density of the dataset, the model converged rapidly, with convergence effectively reached around the 4000th mini-batch. This fast convergence eliminated the need for additional epochs. Considering both training time and performance, EdgeBERT represents an effective trade-off, as the full BERT model with over 100 million parameters is not strictly necessary for intrusion detection in this context.
Figure 6 illustrates the confusion matrix for the 15-class classification scenario. The model is able to almost perfectly distinguish between all classes, although a small amount of confusion is observed between the ‘Fingerprinting’ and ‘DDoS-HTTP’ classes.
Table 4 presents the classification report for the Edge-level model. These results highlight the effectiveness of the transformer-based architecture in distinguishing and classifying a broad range of attack types. However, the Fingerprinting class exhibits the lowest performance, with an F1-score of 0.9673. This behavior is partly explained by its relatively small number of instances, which biases the model toward predicting the most semantically similar classes.
Interestingly, other minority classes such as MITM achieve strong performance despite their limited support. This indicates that class imbalance is not the only contributing factor; the semantic separability between classes also plays a crucial role. When two attack types share similar structural or behavioral patterns, the embedding space becomes less discriminative, making fine-grained separation more challenging. In the case of Fingerprinting, its proximity to certain DDoS-related patterns likely contributes to the observed confusion.
At the IoT level, the compact 137 KB neural network achieves an accuracy of 99.99% by following the process outlined in Algorithms 2 and 3. During inference, the IoT model queries the edge-based model to obtain embedding vectors, which are then used as input features for classification. These embeddings are rich and high-dimensional, capturing the semantic characteristics of network traffic, which enables the IoT model to effectively discriminate between different traffic categories.
This level of abstraction is particularly suitable for resource-constrained IoT devices. By relying on precomputed embeddings from the edge, the IoT model eliminates the need for local feature extraction and other computationally intensive operations required for fine-grained classification, while still maintaining near-perfect accuracy.
Figure 7 illustrates the model’s performance across the five-category scenario. Training the IoT model required approximately 20 min per epoch on a CPU, and convergence was achieved after only two epochs. This demonstrates the efficiency and feasibility of deploying EdgeBERT-derived embeddings on lightweight IoT models, achieving both high accuracy and low computational overhead.
An important consideration in this setup is the network data rate, which directly impacts inference latency. On average, the edge-based model generates embedding batches in 9 ms, while the IoT-level model performs classification in 70 ms on a 2.5 GHz CPU with 4 cores.
Table 5 summarizes these metrics. These latency magnitudes align well with real-world IoT requirements, ensuring rapid decision-making and responsiveness. However, a potential bottleneck in this pipeline is the network data rate, which is crucial for transferring the generated embeddings from the edge to the IoT devices. Insufficient throughput or unstable connectivity may increase transmission delays, limiting how quickly the IoT-level model can perform inference.
As shown in
Figure 8, the IoT-level binary classifier further demonstrates the robustness of our scheme by accurately flagging anomalous traffic instances. This confirms our proposed SEED framework is both computationally efficient and effective for real-time intrusion detection at the network edge.
While promising, the proposed method also presents several limitations that must be addressed for real-world deployment. A primary concern lies in the transmission of embeddings from the edge to IoT devices. Each embedding vector consists of 256 FP32 values (approximately 1 KB), which is lightweight in storage but may impose non-negligible communication demands, especially under constrained or unstable network conditions. Since real-world IoT networks are prone to latency, interference, and bandwidth fluctuations, maintaining continuous embedding transfers may become a performance bottleneck. Future work should therefore investigate mitigation strategies such as local caching, temporary buffering of embeddings, or adaptive communication schedules.
Another important consideration relates to the communication strategy between IoT devices and the edge, particularly during Step 2 of the SEED pipeline illustrated in
Figure 2. In our experimental setup, embeddings are requested in batches of size
during inference to optimize throughput. However, determining the optimal batch size and more generally, when communication should be initiated, remains a non-trivial design choice. Excessively frequent communication may overload the network, whereas overly large batches may introduce latency. Exploring dynamic batching or event-triggered communication policies represents a promising direction for improving system efficiency.
Finally, the embeddings transmitted over the network may be susceptible to interception or reverse engineering, potentially revealing sensitive device metadata such as IP addresses or traffic characteristics. Because many IoT communication protocols are lightweight and lack robust built-in security mechanisms, incorporating additional protection layers (e.g., encryption, secure channels, or embedding obfuscation) is essential to preserve confidentiality and prevent data leakage. Addressing these challenges will be key to ensuring a secure and reliable deployment of the SEED framework in practical IoT environments.
It is also crucial to account for network reliability. Designing fallback mechanisms or jump strategies can help ensure continuous service availability, mitigating inference delays or failures caused by intermittent connectivity. Such measures are fundamental to maintaining robust and dependable intrusion detection in practical IoT deployments.
The above results demonstrate the viability of our method, though there remains significant room for improvement, as discussed in the results analysis. With the integration of these enhancements, particularly advancements in the communication strategy between the two-tier architecture, our approach will be able to operate more efficiently, robustly, and securely within real-world IoT environments.
While LLMs offer exceptional performance and significant advantages, they are not fully reliable, and blind reliance on their outputs can introduce risks. It is therefore essential to evaluate the ethical considerations and privacy implications associated with the use of LLMs in general and within cybersecurity applications in particular.