1. Introduction
The Internet of Things (IoT) is one of today’s most popular technologies. It spans everything from the smartphones in our pockets to the sensors deployed in industrial factories [
1]. By 2025, it is expected that there will be more than 75 billion IoT devices worldwide [
2]. In general, IoT refers to large numbers of heterogeneous devices connected to the network typically via gateways that can communicate with each other and take appropriate actions based on the data they collect, analyze, and share. These devices can take appropriate action based on the collected, analyzed, and shared data.
IoT in smart homes emphasizes on communication between individual nodes that collect and transmit sensitive data and control critical infrastructure. It functions on the three layers presented in
Figure 1. Using sensors and actuators, the Perception layer gathers and interprets data from the environment. The network layer uses technologies such as WiFi, LTE, Bluetooth, 3G, and Zigbee to transport and deliver data to IoT hubs and devices via the Internet. The Application layer protects the authenticity, integrity, and secrecy of data in order to achieve the goal of establishing a smart environment. Each layer contains security flaws.
In our work, we focus on network layer and how to protect the smart home network from intruders and eavesdroppers. Ensuring security and privacy requires authenticating the nodes that communicate with each other or with gateways [
3]. However, due to the characteristics of these nodes which include limits in terms of processing power and resources and heterogeneity of the connected nodes, the security of IoT is considered challenging.
One of the primary goals in the context of IoT security is to keep essential assets safe from unauthorized access and ensure the integrity of data and systems by verifying the identity of entities trying to access resources or perform actions. Traditional node identifications that rely on MAC addresses, IP addresses, Bluetooth ID, and Zigbee ID may be easily forged [
4]. The use of cryptographic techniques for identification and authentication is an effective way to ensure confidentiality and authentication. However, in IoT devices, the vulnerability arises from the use of smaller-bit-sized cryptographic keys due to limited capabilities, making them more susceptible to hacking [
5]. Identification of devices within the network plays an important role in preventing unauthorized access to or misuse of network resources. Since the early days of the Internet, network traffic categorization has been a key issue, with numerous methodologies ranging from port-based to statistical and behavioral methods. Internet Service Providers (ISPs) have prioritized network traffic classification in order to manage network performance and security [
6]. Because of the significance of privacy and security for providers and end users, this has gained traction in IoT through integration with machine learning. Traffic classification technology has been extensively employed in numerous scholarly investigations within IoT systems [
7].
All supervised [
3,
8], unsupervised [
5], and deep learning [
9,
10] methods have been used for classification and identification. Supervised machine learning showed high accuracy when identifying devices. On the other hand, they encountered various challenges [
11], which can be succinctly summarized as follows: must train the model again when a new device type is connected to or added to the network, or when the behavior of the device changes legitimately. Also, to avoid overfitting when training the model, a huge number of labeled and balanced datasets are required. Moreover, detecting unknown traffic is critical for identifying new types of devices and traffic, which cannot be recognized by supervised learning methods.
Due to the limited availability of publicly accessible IoT network traffic datasets, researchers face significant challenges in training and evaluating device identification models. Deploying physical IoT devices to collect real traffic is possible but often costly and difficult to scale to networks with hundreds or thousands of heterogeneous devices, while privacy and confidentiality concerns further restrict the sharing of raw captures [
12]. In addition, class imbalance is a well-recognized challenge in IoT device identification: traffic from popular or always on devices (such as cameras or hubs) heavily dominates that of less frequently active or niche devices (such as specific sensors or appliances). This skewed distribution biases classifiers toward majority device types and leads to poor generalization and low identification accuracy for minority devices, even though these rare devices may be operationally important [
10].
Traditional oversampling methods such as SMOTE attempt to mitigate class imbalance by interpolating synthetic minority samples in the feature space [
13]. However, recent work has shown that SMOTE and its variants can fail to capture complex, non-linear and multimodal distributions; may ignore local density structure; and can place synthetic samples in low-density or overlapping regions, which degrades classifier performance in high-dimensional and noisy settings [
13]. As an alternative, generative models based on Generative Adversarial Networks (GANs) have gained traction for data augmentation in cybersecurity and network traffic modelling, including applications to class-imbalanced traffic datasets [
14]. While these approaches can better approximate high-dimensional distributions, many GAN-based traffic generators still treat flows as flat feature vectors and do not explicitly encode protocol rules, device-specific behavior, or higher order dependencies between features, which may result in synthetic traffic that matches marginal statistics but lacks semantic consistency and protocol fidelity [
15].
Within the family of generative models, Wasserstein Generative Adversarial Networks (WGANs) have been shown to improve training stability and sample quality by optimising the Wasserstein distance between real and generated distributions [
16]. Nevertheless, conventional WGAN variants still lack mechanisms to integrate domain-specific knowledge and complex structural dependencies between network traffic features, which can lead to synthetic samples that do not fully respect protocol constraints or realistic IoT device behavior. Recent IIoT traffic fingerprinting studies also report a clear distribution gap between synthetic and real Internet-facing deployments, especially in terms of timing and flow dynamics [
17]. Our generative model partly addresses this by being constrained with statistics and semantic rules extracted from real IoT traffic.
To address these challenges, this research presents an integrated framework that combines graph-conditioned WGANs with dynamic guidance derived from LLMs to generate synthetic IoT network traffic that is both realistic and semantically contextual. The proposed framework relies on a feature relationship graph, constructed using a combination of Pearson and Spearman correlation coefficients along with mutual information, to capture meaningful structural dependencies and guide the generation process. In addition, LLMs are used to derive class-specific semantic constraints, including numeric ranges, attribute associations, and protocol rules, thereby improving the plausibility and interpretability of the generated samples. An LLM-based validation mechanism is further employed to assess the reasonableness of the synthetic traffic and its relevance to practical IoT scenarios.
Our investigation offers contributions to both IoT device identification and techniques for handling imbalanced datasets. The main contributions of this work are summarized as follows:
It presents a comprehensive analysis of machine learning-based approaches for identifying IoT devices using network traffic features.
It proposes a novel data balancing framework based on Wasserstein Generative Adversarial Networks (WGANs) to generate realistic synthetic samples for underrepresented device classes.
It demonstrates the effectiveness of the WGAN-based approach in improving classification performance through extensive experimentation and comparison with traditional balancing techniques such as SMOTE.
The remainder of this paper is structured as follows:
Section 2 presents a review of the related literature.
Section 3 describes the materials and methods employed in this study. The experimental results are detailed in
Section 4, followed by an in-depth discussion in
Section 5. Finally,
Section 6 concludes the paper.
4. Experiments and Results
The experiments were conducted to evaluate the effectiveness of the proposed constraint-guided WGAN framework, which integrates network graph-based feature relationships and LLM-derived semantic constraints for synthetic network traffic generation. The experimental workflow follows a multi stage pipeline that begins with the extraction of raw features from captured PCAP files and proceeds through preprocessing, graph construction, constraint extraction, and WGAN training. The process starts with raw network traffic data, where Scapy is used to extract flow level and packet level statistical features. The data preprocessing stage detects numerical and categorical attributes, scales values to the range [−1, 1] [−1, 1], and encodes categorical features for model compatibility. Next, the network graph construction module computes Pearson, Spearman, and mutual information correlations among features to build a feature relationship graph, which helps identify relevant dependencies. A constraint extraction step is then performed using GPT-4, producing a structured JSON file that defines valid feature ranges, allowed categorical values, and correlation patterns for each device class. These constraints are subsequently integrated into the constraint-guided WGAN training, where the generator is regularized through penalty terms that enforce semantic and statistical consistency. Finally, the trained generator produces a synthetic dataset that is inverse transformed back to the original feature scale and balanced across device classes for machine learning evaluation.
The following subsection provides a more detailed explanation of the synthetic data generation process, including the integration of the network graph, LLM-based constraint extraction, and WGAN training procedure.
4.1. Network Graph Construction Results
As we mentioned before, the network graph visualizes the pairwise relationships among the extracted features as a graph.
Figure 6 shows the 20 main central features extracted from our IoT traffic dataset, where node size and color intensity encode the combined centrality score of each feature, and the edge thickness reflects the strength of pairwise correlations. The topology not only reveals which features are most influential and tightly coupled but also guides our feature selection: we prioritize nodes with both high centrality and strong inter-feature correlations to construct more robust classification models.
4.2. Synthetic Dataset Generation Model with Network Graph and LLM
We utilized a data dimension of 97 features and 27 device classes. The adjacency matrix, representing feature relationships, had a size of (97, 97), ensuring the preservation of statistical and domain-specific dependencies during data generation. The constraints derived from the LLM model were extracted for each device class, ensuring that the generated samples adhere to protocol-specific rules, feature correlations, and numerical constraints.
Figure 7 illustrates example of extraction of numerical, correlational, and protocol constraints for the D-LinkSwitch class. After sending the prompt, our parser automatically identified and validated three sections (numerical_constraints, feature_correlations, and protocol_rules), which we fed into the downstream traffic-generation module.
4.2.1. WGAN Model (Constraint-Guided)
The WGAN model was trained over 100 epochs, and the critic network was updated five times per generator update (CRITIC_STEPS = 5) to maintain the balance between discriminator and generator learning. The loss of the critic and the loss of the generator was monitored throughout the training to evaluate convergence. The key training parameters of the WGAN model are presented in
Table 4.
In the 2D PCA diagram
Figure 8a, the real and generated samples show substantial overlap, appearing as two largely congruent point clouds with only minor deformations. The generated distribution is slightly elongated, effectively stretching to span low-density gaps and cover implanted (rare) regions, but without introducing pronounced shifts or artifacts. Taken together, this visualization indicates that the WGAN has reproduced the global variance (covariance) structure of the data with high fidelity, capturing the principal components and overall manifold while exhibiting only small calibration differences.
For the t-SNE in
Figure 8b, the real and generated samples largely occupy the same regions, forming grouped clusters that align across both datasets. The synthetic clusters occasionally appear slightly denser or sparser than their real counterparts, reflecting intentional coverage of class imbalance, yet the overall map reads as a homogeneous mixture rather than separated swaths. This pattern indicates that, at the local-neighborhood scale emphasized by t-SNE, the generated data has effectively captured the same underlying manifold as the real data. In the correlation heatmaps
Figure 9, the real and generated matrices appear nearly indistinguishable: strong and weak correlations recur in the same blocks, and the fine-grained patterns are closely mirrored across features. This close visual agreement indicates that the WGAN has effectively captured the covariance and inter-feature dependency structure, reproducing both the dominant blocks and the subtle relationships with near-perfect fidelity. As shown in
Figure 10, the real (purple) distribution is highly skewed, whereas the WGAN-balanced set (pink) is approximately uniform across classes.
4.2.2. CTGAN (Constraint-Guided) Baseline
To test whether our findings depend on the generator family, we evaluated CTGAN under the same pipeline used for WGAN. All constraints were extracted from the training data using an LLM: the LLM consumed feature statistics and network graph relationships and outputted a structured rule set (empirical value ranges, valid category sets, formatting/type checks, and cross-feature relations). These data-derived constraints were saved to a JSON specification and kept fixed for all models. CTGAN uses the same condition vector as WGAN (domain/task conditions plus constraint tokens), and the same preprocessing, train/validation split, batch size, learning rates, and optimizer. We also applied the same enforcement: violations during training increase the generator’s objective, and a light rejection step filtered invalid samples at inference using the same tolerance.
Table 5 summarizes the primary CTGAN settings.
For the result of CTGAN, the two dimensional PCA projection in
Figure 11a shows that real and synthetic samples occupy a broadly similar region of the feature space, with both clouds aligned along the same dominant manifold. However, the synthetic distribution appears slightly more compact and smoothed, with reduced spread in the extreme regions, indicating that CTGAN captures the main axes of variance but underestimates some of the tail behavior present in the real data. A complementary t-SNE visualization in
Figure 11b confirms this pattern at the local scale: synthetic points are interleaved with real samples within most clusters, but some clusters are more diffuse or partially merged, suggesting a tendency to regularize highly sparse or imbalanced regions. The correlation heatmaps in
Figure 12b further illustrate that CTGAN reproduces the overall block structure of inter-feature dependencies, with strong and weak correlations recurring in the same groups of features, while a subset of blocks exhibits attenuated correlation magnitudes. Taken together, these qualitative analyses indicate that CTGAN provides a reasonable but slightly over-smoothed approximation of the real traffic distribution, capturing the dominant covariance structure while losing some of the fine grained variability that is better preserved by the WGAN model.
4.2.3. WGAN vs. CTGAN
To better understand the behavior of the two generative models, we qualitatively compared WGAN and CTGAN along three complementary views: global variance structure (PCA), local neighborhood structure (t-SNE), and inter-feature dependencies (correlation heatmaps). Both models are able to place synthetic samples in the same broad region of the feature space as the real traffic, indicating that they capture the dominant modes of variation. However, consistent differences emerge when inspecting the fine-grained alignment between the real and generated distributions.
For the WGAN, the PCA projection shows a strong overlap between the real and synthetic samples along the main manifold, with only minor elongations at the tails, while the t-SNE plot reveals interleaved clusters rather than separated real–synthetic islands. The corresponding correlation heatmaps for WGAN are also closely matched to the real matrix, reproducing the block structure and most of the strong/weak correlations. In contrast, CTGAN produces a slightly more compact and smoothed distribution in PCA space, with underestimation of some extreme regions, and t-SNE clusters that are sometimes more diffuse or partially merged. Its correlation heatmaps recover the main blocks of dependencies but tend to attenuate some of the correlation magnitudes, particularly in sparse or highly imbalanced regions.
Overall, these observations suggest that while CTGAN provides a reasonable approximation of the real traffic and is effective at regularizing very sparse regions, WGAN offers a closer match to both the global manifold and the local cluster structure, and more faithfully preserves inter-feature dependencies. For this reason, we adopt WGAN as the primary generator in the remainder of our experiments, and use CTGAN mainly as a comparative baseline.
4.3. Machine Learning Performance
The evaluation was conducted on the machine learning models mentioned in the previous section using the preprocessed dataset. The original real traffic dataset was first partitioned into three disjoint subsets using a stratified split on the device label, yielding 67,734 samples for training, 16,934 for validation, and 21,167 for testing, as reported in
Table 6. Stratification ensures that the real data for each device type are proportionally represented in all three subsets and reduces the risk of bias toward any particular class.
After this initial split, we employed the WGAN-generated samples to balance the label distribution within each subset. For every device class in the training, validation, and test sets, synthetic flows were added up to a fixed per-class target so that all the classes are represented with (approximately) equal frequency. This procedure has three benefits: (i) each subset contains a mixture of genuine and generated traffic for every device type; (ii) the class distribution within each subset is effectively balanced; and (iii) the separation between training, validation, and test sets is preserved, preventing information leakage. The balanced training and validation subsets are then used to fit and tune the machine learning models, while the balanced test subset is used exclusively for final performance evaluation.
The performance of each model was assessed using accuracy, precision, recall, and F1-score, along with computational efficiency in terms of training time and memory consumption.
Table 7 summarizes the overall test performance of the models.
On the WGAN-balanced dataset, Random Forest achieved the best overall performance (accuracy (=0.9409), precision (=0.9418), recall (=0.9409), F1 (=0.9402)). It also trained the fastest ((15.37) s) but consumed the most memory ((2558.66) MB). Neural network ranked second (accuracy (=0.9083), precision (=0.9093), recall (=0.9083), F1 (=0.9074)) with a training time of (139.41) s and memory use of (443.39) MB. SVM recorded the lowest scores (accuracy (=0.8899), precision (=0.8903), recall (=0.8899), F1 (=0.8884)), required the longest training time ((26,513.04) s), and used the least memory ((10.72) MB).
In addition, the SMOTE technique was applied to the data set to address class imbalance, followed by classification using various machine learning models. The performance of each model was comprehensively evaluated using multiple metrics, including precision, precision, recall, F1 score, training time, and memory usage. The results are summarized and compared in the corresponding evaluation
Table 8, which provides a comparative analysis of three machine learning models applied to a SMOTE-balanced data set.
On the SMOTE-balanced dataset, Random Forest achieved the best results (accuracy (=0.9246), precision (=0.9394), recall (=0.9246), F1 (=0.9283)). It also trained the fastest (88.60) s) but used the most memory (177.64) MB). Neural network lagged behind (accuracy (=0.3492), precision (=0.4852), recall (=0.3492), F1 (=0.2842)) with a training time of (142.17) s and memory of (82.13) MB. SVM performed worst (accuracy (=0.0618), precision (=0.0038), recall (=0.0618), F1 (=0.072)), took the longest to train ((1297.76) s), and used (92.10) MB of memory.
The confusion matrices in
Figure 13 show strong main diagonal high accuracy with most errors concentrated among a small family of look alike devices chiefly D-LinkSensor, D-LinkWaterSensor, D-LinkSwitch, and the sibling plugs TP-LinkPlugHS110 and TP-LinkPlugHS100 which are frequently confused with one another. Random Forest yields the cleanest diagonal and the narrowest spill within this D-Link/TP-Link cluster; the neural network is close but exhibits slightly more bleed, while SVM shows the broadest cross-class confusion. Outside this cluster, several classes are classified almost perfectly, notably MAXGateway, HomeMaticPlug, Withings, iKettle2, and D-LinkDoorSensor. This pattern aligns with the aggregate metrics (Random Forest best on accuracy/F1) and suggests that remaining errors stem from intra brand similarities, motivating a hierarchical family first classifier or richer discriminative features to separate the D-Link/TP-Link subgroup.
5. Discussion
The main objective of this work is to investigate whether a graph and LLM-guided WGAN can provide high-quality synthetic IoT traffic for addressing severe class imbalance in device identification. The results across both qualitative and quantitative evaluations indicate that the proposed framework does not merely increase the number of minority samples, but also preserves essential structural properties of the real traffic in a way that is beneficial for downstream classifiers.
5.1. Qualitative Behavior of WGAN vs. CTGAN
The qualitative comparison between the proposed WGAN configuration and a CTGAN baseline, using two dimensional PCA projections, t-SNE embeddings, and feature feature correlation heatmaps, shows that both generators are capable of producing plausible synthetic flows that occupy the same broad region of the feature space as the real data. However, the WGAN exhibits a visibly tighter alignment with the real traffic: in PCA space, real and WGAN-generated points largely overlap along the main manifold; in the t-SNE plots, clusters associated with different device types appear interleaved rather than separated into real versus synthetic islands; and in the correlation heatmaps, the WGAN closely reproduces the block structure and relative strength of inter-feature dependencies. By contrast, the CTGAN baseline tends to produce a smoother and more compact distribution, slightly under representing extreme regions and weakening some correlations, especially for device types with very few real samples. These qualitative observations motivate the choice of the WGAN as the primary generator for constructing the balanced datasets used in the subsequent classification experiments, while CTGAN is retained as a secondary qualitative baseline.
5.2. Impact of WGAN-Based Balancing on Classification Performance
The classification results on balanced datasets highlight the effectiveness of the proposed WGAN-based balancing strategy compared with a conventional oversampling method such as SMOTE. When training on WGAN-balanced data, the evaluated machine learning models consistently achieve higher macro-F1 and balanced accuracy than when trained on SMOTE-balanced data, confirming that the quality and structure of synthetic flows are at least as important as their quantity. The WGAN-generated samples provide richer information about minority device types, enabling the classifiers to better separate classes that are severely under represented in the original traffic.
Across all the regimes, Random Forest emerges as the most effective classifier, delivering the highest overall performance and the most stable behavior across metrics. The neural network remains competitive, particularly on the WGAN-balanced data, while SVM systematically lags behind on this high-dimensional, multi-class device-identification task. These trends are consistent with the intuition that ensemble and deep models are better able to exploit the nuanced, graph and constraint-preserving structure present in the WGAN-generated flows, whereas SVM is more sensitive to residual overlap and class complexity.
From a computational perspective, the WGAN-balanced regime is also attractive in practice. For example, when trained on WGAN-generated balanced data, the Random Forest model achieves superior predictive performance while requiring substantially less training time than on SMOTE-balanced data (15.37 s versus 88.60 s), which is an important consideration for large-scale or frequently retrained deployments. Although the WGAN-based setting incurs a higher memory footprint for Random Forest (2558.66 MB versus 177.64 MB), this overhead is often acceptable in modern computing environments given the corresponding gains in accuracy and robustness. Overall, these results indicate that the combination of a feature relationship graph, LLM-guided semantic constraints, and a WGAN backbone yields a synthetic data generation mechanism that not only corrects class imbalance, but also enhances the discriminative power and efficiency of standard machine learning models for IoT device identification.
5.3. Limitations and Future Work
Although the proposed framework already shows clear advantages over SMOTE and a CTGAN baseline, several aspects remain open for further strengthening: First, we evaluate the current system on a stationary traffic snapshot and therefore do not explicitly measure long term data drift, even though the design naturally allows re-estimating the feature relationship graph and recomputing the LLM-derived constraints on new traffic windows, followed by retraining the WGAN, to adapt to evolving behavior in future deployments. Second, while the LLM-guided constraints are derived only from aggregated feature descriptions and never access raw payloads, integrating formal privacy-preserving and robust LLM techniques would provide additional assurance in security-critical settings. Third, it will be important to analyse how the generated traffic behaves under adversarial conditions, where an attacker might attempt to exploit synthetic samples to evade detection or poison downstream models. Exploring these directions will further enhance the reliability and impact of the proposed approach in AI-assisted cybersecurity and imbalanced network data analysis.