1. Introduction
The numbers of connected IoT devices have been growing rapidly in recent years. Recent estimates indicate that approximately 18.8 billion devices were connected worldwide by the end of 2024, with projections suggesting a doubling to nearly 40 billion devices by 2030 [
1]. This rapid growth of IoT has introduced a highly heterogeneous ecosystem characterized by constrained devices with limited computational capabilities, minimal firmware support, and reduced ability to implement advanced security protocols. These limitations shift the responsibility of securing such devices to the surrounding network infrastructure, which must provide robust mechanisms to classify, monitor, and detect anomalies in their behavior. As IoT deployments scale across industrial, enterprise, and smart home environments, maintaining device visibility becomes essential for enforcing reliable security controls.
Device classification, the process of identifying the type or group of an IoT device [
2], has emerged as a critical enabler for network-based security frameworks. Accurate classification enables the system to distinguish between legitimate and unauthorized (unknown) devices, thereby serving as a gate-keeping function at the network perimeter. Once a device is classified, this classification can be leveraged to enforce Zero Trust principles [
3], such as granting only the minimal set of permissions required for its operation (Least Privilege) and automatically denying access to unrecognized or suspicious devices. Zero Trust is a security framework that challenges the traditional concept of trusting entities within a network perimeter. Instead, it is based on the idea of “never trust, always verify” [
3,
4], ensuring that every access request is authenticated and authorized, regardless of its origin. Device classification enables the application of Zero Trust in IoT environments [
5], where many devices lack advanced authentication capabilities.
One standardized approach to leveraging device classification for access control is the Manufacturer Usage Description (MUD) framework, defined in IETF RFC 8520 [
6]. MUD allows IoT device manufacturers to provide a machine-readable policy that specifies the device’s expected network behavior, such as permitted protocols, ports, and destinations. By using these predefined classes, network devices can automatically enforce least-privilege connectivity and block unexpected traffic without requiring device-specific manual configuration. MUD can significantly reduce the IoT attack surface by ensuring that each device communicates only as intended, while deviations may indicate misconfiguration or compromise. In cases where IoT devices do not emit a MUD URL via DHCP, LLDP, or certificate fields, NIST SP 1800-15 [
7] suggests manual association of MUD profiles based on device identifiers (e.g., MAC address). This mechanism ensures the device can still be governed by least-privilege access policies derived from the MUD profile even when automatic URL signaling is unavailable. However, MAC addresses can be easily spoofed, which necessitates the use of more advanced and passive classification methods that classify devices based on their traffic patterns and behavioral characteristics rather than solely on declared identifiers.
Figure 1 illustrates a multi-layered security approach for IoT systems, emphasizing the integration of device profiling into network-based defenses. The lower layers focus on device identification, which can be achieved through operating system messages, installed agents, or authentication parameters. However, many IoT devices lack these capabilities, making network-level observation essential for system protection. Device classification can therefore rely on discovery protocols, active probing, or passive traffic analysis, enabling the identification of device types without requiring on-device instrumentation.
Once devices are classified, their profiles can feed into the upper layers of the defense system, where profiling information supports network traffic classification (NTC) and anomaly detection mechanisms. This integration allows the security framework to detect and reject unknown devices, enforce traffic policies on known devices, and strengthen IoT resilience against both external and internal threats as part of a comprehensive multi-layered defense strategy.
Standards such as IEEE 802.1X [
8] and IEEE 802.1AR [
9] provide foundational mechanisms for authenticating IoT devices at the network edge. IEEE 802.1X enables port-based access control, typically using the Extensible Authentication Protocol (EAP), to ensure that only authenticated devices can communicate on the network. IEEE 802.1AR, in turn, specifies the use of secure device identities (DevIDs) embedded in hardware, which can be used to cryptographically verify a device’s authenticity during onboarding.
While these protocols establish trust at the point of connection, they cannot guarantee ongoing compliance or detect behavioral deviations after initial authentication. Devices may be misconfigured, compromised, or exhibit malicious behavior even after successful IEEE 802.1X/802.1AR authentication. Therefore, network-side profiling and anomaly detection remain essential to monitor traffic patterns, identify rogue or misbehaving devices, and maintain Zero Trust principles in dynamic IoT environments.
Traditional device classification methods based on rule-based classification offer computational efficiency and are effective in recognizing known or unknown devices based on predefined patterns. However, these systems depend heavily on expert-defined rules and require manual updates to remain effective. Any modification, addition of new device attributes, or emergence of novel device types necessitates rewriting or extending the rule base. This inflexibility limits scalability and adaptability in dynamic IoT environments. Moreover, rule-based methods struggle to generalize across diverse device behaviors, reducing their robustness in complex or evolving networks [
10].
In recent years, AI and ML have shown promise in improving IoT device classification using passive traffic analysis [
11]. Yet, most conventional ML models grow in complexity as device diversity increases and may require substantial computational overhead or frequent retraining, an impractical requirement for resource-limited or air-gapped environments. To address these limitations, lightweight and explainable models are needed to offer high accuracy without compromising scalability or security posture.
Beyond accurate device classification, unknown device detection is essential for enforcing Zero Trust principles in IoT networks [
12]. Zero Trust assumes that no device should be inherently trusted, and continuous verification is required to prevent unauthorized access or lateral movement. Detecting unknown or unregistered devices in real-time serves as a first line of defense, enabling the network to block suspicious activity and enforce least privilege access policies. Given that IoT ecosystems are inherently resource-constrained, implementing unknown detection as an integrated component of the same classification model, rather than deploying it as an additional standalone layer, is critical.
This paper introduces an enhanced IoT device classification framework that modifies the logic of traditional supervised single-class algorithms, such as Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), and Multilayer Perceptron (MLP), to support multi-class and multi-label classification. This modification enables the grouping of devices from different vendors that, despite branding differences, share similar hardware architectures, firmware components, and network traffic patterns, particularly during the booting phase. Recognizing their behavioral similarity, these devices can be classified and treated uniformly across multiple layers of defense. This unified treatment enables scalable policy enforcement and enhances detection accuracy in anomaly detection systems by aligning baseline expectations to device types rather than individual models. The proposed approach reduces computational overhead and simplifies model deployment in real-world IoT networks. Experimental results demonstrate the feasibility and effectiveness of this method, achieving near 100% accuracy under the multi-label evaluation setting with calibrated thresholds and providing a strong foundation for enforcing Zero Trust Security and Least Privilege access in constrained IoT environments.
The remainder of this paper is structured as follows:
Section 2 reviews related work on IoT device classification and unknown device detection.
Section 3 presents the overall system architecture and its main components.
Section 4 describes the network feature extraction process, including tokenization and vectorization steps.
Section 5 outlines the experimental design, dataset composition, and model training and evaluation approach.
Section 6 details the probability calculation process used in classification.
Section 7 introduces the proposed multilabel classification methodology, while
Section 8 explains the unknown device detection method.
Section 9 reports and analyzes the experimental results.
Section 10 discusses the limitations of the proposed approach and suggests directions for future work. Finally,
Section 11 concludes the paper.
2. Related Work
One approach in classifying devices is analyzing traffic packets using deep learning with multiple layers, including techniques such as CNN for feature extraction. This approach needs a lot of computational power and time for training. Takasaki et al. proposed a two-stage method for device classification that relies solely on traffic behavior [
13]. Using ML on packet header statistics, their approach first classifies devices into general categories (IoT, non-IoT, and routers). It then uses deep learning techniques such as LSTM and CNN to classify specific IoT device types by analyzing traffic waveforms. The method demonstrated superior accuracy compared to existing approaches, particularly in diverse environments with mixed device types, making it highly suitable for automated network management in 5G and 6G contexts.
Another approach is analyzing traffic over long periods to identify patterns that can be used to classify IoT devices. This approach needs time to classify devices as it relies on activity patterns during different phases of device operation. Sivanathan et al. introduced a robust framework for classifying IoT devices in smart environments based on network traffic characteristics [
14]. Their study involved a testbed of 28 IoT devices over six months, analyzing attributes such as activity cycles, signaling patterns, and cipher suites. Using a multi-stage ML algorithm, they achieved over 99% accuracy in identifying IoT devices. This work highlights the potential of traffic-based classification for monitoring device functionality and security within diverse IoT ecosystems.
Keywords can also be used to classify devices. Although this method is relatively fast and requires less computational power, it may be less accurate, and selecting the right keywords requires a deep understanding of IoT devices, their OS, and unique traffic patterns. Khandait et al. introduced IoTHunter, a network traffic classification framework that identifies IoT devices using device-specific keywords [
15]. IoTHunter automatically extracts keywords, such as domain names or device identifiers, from network flows and combines this information with MAC addresses to label traffic accurately. Evaluated on a UNSW [
14] dataset of 28 IoT devices collected over 6 months, the approach demonstrated robust performance, effectively identifying encrypted and non-encrypted traffic while addressing challenges such as overlapping flow characteristics.
Keyword-based approaches are further extended by another classifier, IoTFinder [
16], which leverages passive DNS traffic analysis for large-scale IoT device identification. IoTFinder automatically extracts statistical DNS fingerprints and applies TF-IDF vectorization with multi-label classification to handle cases where multiple IoT devices are co-hosted behind the same NAT device. Unlike manual rule-driven methods, IoTFinder is designed for efficient, ISP-scale operation, accurately classifying millions of devices while maintaining low false positive rates. This demonstrates the effectiveness of combining keyword-based extraction with ML and passive classification to enhance IoT device discovery and security monitoring. However, to generate robust fingerprints, several days of DNS traffic are used before deploying the model for classification, which is a limitation for real-time Zero Trust enforcement, as the model cannot classify a new device immediately upon first boot.
Although prior work has demonstrated strong performance in IoT device classification, including approaches that leverage DHCP and DNS metadata, several practical limitations hinder their direct deployment in real-world IoT environments. Many existing methods rely on extended traffic observation windows or aggregated behavioral patterns collected over time. While effective for offline analysis, such requirements delay device identification and are not suitable for scenarios where immediate decisions must be made at the point of network access.
In Zero Trust environments, device classification must occur during the earliest stages of communication, often at boot-time, before full traffic behavior is observable. This work specifically focuses on enabling classification using only the initial DHCP and DNS exchanges, minimizing latency and computational overhead. In addition, the proposed framework integrates multi-label classification and unknown device detection within a single lightweight model, addressing both behavioral overlap between devices and the need for real-time enforcement at IoT gateways.
Although some previous approaches have demonstrated high classification accuracy, several practical limitations hinder their direct deployment in IoT environments. Some methods require significant computational resources, making them unsuitable for edge devices or IoT gateways with constrained processing and memory. Other techniques rely on long observation windows to accumulate sufficient traffic for reliable classification, which conflicts with the Zero Trust principle of denying or restricting traffic until a device is properly verified. Furthermore, certain approaches still exhibit sub optimal accuracy, whereas near-perfect identification is often required to enforce strict network policies without inadvertently blocking legitimate IoT devices or allowing unknown ones.
Another critical aspect of IoT device classification is the detection of unknown devices, which is essential for enforcing Zero Trust principles. A robust classification model should not only classify known devices accurately but also identify devices that were never seen during the training phase as unknown. This may include unknown IoT devices, or even non-IoT devices, all of which pose a potential security risk.
For unknown device detection, one approach is to employ ML based allow-list techniques that can flag any device not observed during the training phase as unknown. Meidan et al. developed a method that leverages Random Forest classifiers and extended traffic collection windows to learn device behavior, achieving approximately 96% accuracy in identifying previously unseen devices as unknown [
17]. While effective, this approach highlights the trade off between high detection accuracy and the need for long observation periods, which can conflict with the Zero Trust requirement of restricting traffic until a device is validated.
Table 1 presents a comparison of prior works, highlighting the models employed and the performance achieved. While many of these approaches report near-perfect accuracy, they often require extended observation windows to confidently classify devices. This delayed classification contradicts the core principles of Zero Trust, which mandates that traffic should not be permitted until a device is properly validated. Furthermore, device authentication alone does not guarantee appropriate authorization or least-privilege enforcement. An additional layer of network-based classification is essential to accurately classify the device and restrict its communication to only the traffic patterns necessary for its intended function.
Table 1.
Comparison of Related Work on IoT Device Classification and Unknown Detection.
Table 1.
Comparison of Related Work on IoT Device Classification and Unknown Detection.
| # | References | Model | Feature(s) Used | Unknown Detection | Time Required | Dataset | Performance |
|---|
| 1 | Miettinen et al., 2017 (IoT Sentinel) [18] | RF | 23 traffic features | Yes (assigned level: strict) | N packets, first 12 unique vectors | They created IoT Sentinel dataset | Accuracy per device: 0.5–0.95 |
| 2 | Meidan et al., 2017 [17] | RF | 274 traffic features | Yes | More than 50 days for some device types | Private lab-collected dataset | Accuracy for known: 94%, for unknown: 97% |
| 3 | Sivanathan et al., 2019 [14] | 2 stages. NB + RF | Statistical attributes | No | 1–16 days training | They created UNSW dataset | Accuracy: 99.76% after 16 days |
| 4 | Khandait et al., 2020 (IoTHunter) [15] | Rule-based (if statements) | Device-specific keywords | Includes Unclassified Label | 6 months of data | UNSW [14] | Recall per device type: 0.38–1 |
| 5 | Bao et al., 2020 [19] | RF + OPTICS + autoencoder | 297, 234, 179 flow-based features | Yes (with anomaly detection) | Same as [17] | Same as [17] | 91.2%, 92.9%, 81.8% based on the number of features |
| 6 | Perdisci et al., 2020 [16] | Multi-label | DNS queries and their statistical probabilities | Yes | A few days–2 weeks | IoTDNS, PDNS, LDNS, TrPoTDNS (not shared) | 50/52 with 0.1% FP, 46/52 with 0.01% FP |
| 7 | Ali et al., 2021 [20] | Multiple: RF, DT, SVM, NB, KNN, AD | 20 traffic features from PCAP via NFStream | No | Not provided | UNSW [14], YourThings, deNAT, Public PCAPs | DT on YourThings: Accuracy 99%, F1-s 98% |
| 8 | Dadkhah et al., 2022 [21] | Various (NB, DT, LDA, RF, XGBoost, etc.) | 48 traffic features | No | 3 months | CIC + US Lab dataset | Accuracy: 98.7% AdaBoost (AD) |
| 9 | Takasaki et al., 2023 [13] | 2 stages.Stage 1: Multiple models. Stage 2: MLP + LSTM + CNN | Packet header stats | No | 10m + 10m | UNSW [14] + Private | RF: Accuracy 97.6%/F1-s 92.8% |
| 10 | Zhang et al., 2024 (DevRF) [22] | CNN-BiLSTM, LSTM, GRU, KNN, DT, RF | Protocol info of each layer | No | 1 day | Public IoT, UNSW [14] | Accuracy: 91.19% |
3. System Overview
The proposed classification system is designed to operate at an IoT gateway, focusing on lightweight metadata collection and early enforcement of access policies under Zero Trust principles. To minimize computational overhead and accelerate classification, the system processes only a small subset of network traffic, specifically DHCP and DNS packets, captured during device boot-up. These packets can be collected out of band by applying techniques like DHCP relay and DNS forwarding.
If a device type cannot be confidently identified due to identical or highly similar features shared with other device classes, the system produces multiple labels, representing all possible classifications with comparable probability. This approach achieved near 100% accuracy under the multi-label evaluation setting with calibrated thresholds by ensuring that legitimate devices are not incorrectly blocked, while still adhering to the least-privilege principle. Instead of fully trusting the device, traffic is restricted to the combined policies for all candidate labels, thereby preventing unauthorized communications. Concepts like MUD can be applied to immediately enforce predefined access rules for each label, allowing only the traffic required by the potential device types and denying any other communication.
Figure 2 illustrates the overall workflow of the proposed system operating at the IoT gateway. Upon device connection, early-stage DHCP and DNS metadata are processed by a lightweight classifier (NB, DT, RF, or MLP) to produce a probability distribution over all known device classes. These probabilities are sorted in descending order, and the class with the highest probability is first selected. Additional candidate labels are then iteratively included if their probabilities fall within a predefined threshold relative to the maximum probability, enabling multi-label classification in cases of behavioral similarity among devices.
Once the set of candidate labels is determined, the system evaluates the confidence of the prediction based on the number of selected labels and their cumulative probability. If the prediction exhibits high uncertainty (e.g., many labels with low combined probability or uniform distribution), the device is flagged as unknown and denied access. Otherwise, the selected labels are forwarded to a central policy server, which enforces least-privilege access by applying the union of policies associated with the candidate device types. The resulting policy is then sent back to the IoT gateway for real-time enforcement, ensuring secure and adaptive access control aligned with Zero Trust principles.
3.1. Dataset
We utilized the CIC IoT Dataset 2022 [
21], a comprehensive dataset collected by the Canadian Institute for Cybersecurity (CIC) at the University of New Brunswick. This dataset comprises packet captures from 31 distinct IoT devices in a controlled lab environment.
The CIC IoT Dataset 2022 is widely used in IoT research and is designed to reflect realistic device behavior in controlled yet representative environments. Detailed information about device diversity, including vendors, device categories, and deployment scenarios, is provided in the original dataset publication [
21]. The dataset includes devices from multiple manufacturers and functional categories, capturing variations commonly observed in real-world IoT deployments.
3.2. Traffic Capture and Filtering
Raw network traffic is collected from .pcap files recorded during the power-up sequence of 40 IoT devices, each with three capture sessions. To streamline processing and eliminate noise from local Layer 2 discovery protocols (e.g., ARP, mDNS), the system filters traffic to retain only DHCP and DNS packets. These packets represent the earliest externally visible communication initiated by the device, occurring before any IP resolution or application-layer access. Traffic before DHCP can be blocked by using filtering and microsegmentation techniques.
3.3. Metadata Extraction
A custom pipeline was implemented using TShark (version 4.4.6) and PyShark (version 0.6) to extract protocol-specific metadata from packet capture (.pcap) files. TShark was used to efficiently filter traffic and retain only DHCP and DNS packets, while PyShark enabled structured parsing of protocol fields. For DHCP packets, the system extracts Option 55 (Parameter Request List) and Option 60 (Vendor Class Identifier), along with the source MAC address, which is mapped to the corresponding manufacturer using the IEEE OUI registry. For DNS packets, the first query name generated by the device during initialization is extracted and tokenized.
All extracted attributes are treated as categorical tokens. Each packet capture is then represented as a set of unique tokens, which are concatenated into a space-separated string to form a behavioral fingerprint for the device. This representation is subsequently used as input for vectorization and model training.
In the first experiment, only DHCP request packets were analyzed, where the system:
Extracts DHCP Option 55 (Parameter Request List).
Extracts DHCP Option 60 (Vendor Class Identifier).
Retrieves the MAC address from the DHCP header and enriches it using IEEE OUI data to derive manufacturer identity. Python (version 3.11.9) library mac_vendor_lookup (version 0.1.12) was used for this, which relies on the IEEE official OUI registry.
DHCP options are described in IETF RFC 2131 [
23].
In the second experiment, the first DNS request packet is also processed. DNS query names are tokenized to represent service access intent. All metadata is encoded into bag-of-words representations, enabling vectorization for ML model input.
3.4. Enforcement in Real-Time
The design assumption is that devices do not have static configurations or cached service information; all access must proceed through DHCP and DNS. By classifying the device at the point of the DHCP request or DNS lookup, the gateway can:
This architecture allows real-time classification and policy enforcement while preserving privacy and minimizing latency, making it ideal for constrained IoT environments.
3.5. Experimental Environment
All experiments were conducted on a local machine with the following specifications:
Operating System: Windows 10 Pro, Version 10.0.19045 (Build 19045);
Processor: Intel Core i7-10700 CPU @ 2.90 GHz, 8 cores, 16 threads;
Memory: 32 GB RAM (31.8 GB usable).
This configuration ensured sufficient computational capacity to simulate edge-classification workloads without reliance on cloud resources.
4. Feature Extraction from Network Traffic
The system implements a hybrid pipeline using TShark for efficient packet capture and filtering, and PyShark for structured protocol-aware parsing of .pcap files. After per-packet extraction, the features from each .pcap file are aggregated into a single space-separated string. This new structured dataset becomes the input to the ML model as described in
Figure 3.
4.1. Tokenization
All extracted attributes are stored in sets to eliminate duplicates and then concatenated into a single space-separated string of tokens representing the early behavior fingerprint of the device. This string becomes the input feature used for training the ML models in both device classification and unknown detection tasks. Each DHCP option is considered one token, and each domain name is considered a token. Tokens are space-separated.
4.2. Vectorization
Two vectorization strategies are applied:
Count Vectorization (CV): Captures token occurrence frequency but not importance across the dataset.
Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is a numerical weight that reflects how important a term is in a document relative to a collection of documents.
Data preparation pipeline is described in
Figure 4. Oversampling is essential to avoid bias and as probability in decision tree-based models depend on number of samples in each branch.
5. Experimental Design
The experiment was conducted to evaluate the feasibility and accuracy of classifying IoT devices and detecting unknown devices using the same model. The design focuses on realistic deployment constraints such as constrained resources, minimal data collection, passive observation, and early decision-making.
5.1. Data Collection Environment
The dataset was constructed by capturing .pcap files from 40 IoT devices spanning 31 unique device types during their initial power-up or reboot cycle. Traffic was captured using Wireshark and programmatically processed using TShark and PyShark. Each .pcap file represents a short traffic capture, with only DHCP and DNS packets filtered for analysis.
5.2. Experimental Scenarios
Two separate experiments were performed to assess the contribution of each feature group:
Experiment 1: Used only DHCP attributes (Option 55, Option 60) and MAC OUI as input features.
Experiment 2: Included DNS domain names in addition to DHCP and OUI features.
This allowed the system to isolate the effect of DNS resolution behavior on classification performance and its potential for early anomaly detection.
5.3. Dataset Composition
The final dataset included 120 labeled records, each representing a unique device boot sequence. For each record:
All features were treated as categorical tokens, allowing direct use with count-based and TF-IDF vectorizers.
Each record in the dataset corresponds to a full device power-up sequence captured independently. As defined in the CIC IoT Dataset 2022, devices are powered on individually and traffic is recorded in isolation during boot-up. Therefore, each sample represents an independent observation of early-stage device behavior.
It is important to note that multiple samples may originate from the same physical device, as the dataset includes repeated boot-time captures. Consequently, samples from the same device may appear in both training and testing sets under the current random split.
This design reflects generalization across independent boot sessions rather than across multiple physical devices. A stricter split at the device-instance level was not feasible, as several device types in the dataset are represented by only a single physical device.
5.4. Model Training and Evaluation
Eight classification models were evaluated using a stratified 70/30 train–test split. To address class imbalance, the training set was oversampled prior to model fitting. Each model was tested using two vectorization techniques, CV and TF-IDF, as follows:
Naive Bayes (NB-CV, NB-TFIDF)
Decision Tree (DT-CV, DT-TFIDF)
Random Forest (RF-CV, RF-TFIDF)
Multi-Layer Perceptron (MLP-CV, MLP-TFIDF)
All classifiers employed in this study, NB, DT, RF, and MLP, were implemented using Scikit-learn with default settings.
Table 2 summarizes the main preprocessing, data-splitting, resampling, and model hyperparameters used in the experiments. Each model was tested for its ability to:
Performance metrics included accuracy, top-N prediction confidence, and probability distribution characteristics across classes.
5.5. Unknown Device Detection
In addition to device classification, the system was evaluated for its ability to detect and mark unknown IoT devices, which are devices not previously seen during training. Unknown device detection was evaluated using “open-set” recognition technique [
24] in which a subset of known device types was deliberately excluded from the training set to simulate unseen devices during inference. The model’s response to these masked samples was then analyzed to determine its effectiveness in recognizing out-of-distribution behavior.
To capture the diversity of real-world scenarios, the evaluation considered three distinct unknown device cases:
Unique unknown device: The unknown device shares no features with any of the known training devices.
Partially overlapping device: The unknown device shares one or more features with devices in the training set.
Indistinguishable unknown device: The unknown device is identical or nearly identical to one or more known devices.
6. Probability Calculation
For each input feature vector
, Scikit-learn returns a class probability distribution via the predict_proba() method such that:
This normalized posterior distribution forms the basis of the multi-label prediction logic in this work. The probability computation is model-specific and calculated as follows:
These class probabilities are subsequently used to implement a threshold-based multi-label selection strategy, wherein classes whose probability is within a specified margin (e.g., 5%) of the top predicted class are included in the output set.
7. Multi-Label Classification Methodology
The classification framework adopted in this study extends traditional single-label learning to a multi-label setting, enabling each device to be associated with multiple potential classes based on its behavioral fingerprint. This approach addresses a key challenge in IoT environments: multiple devices from different vendors may exhibit nearly identical network behaviors due to shared hardware components, operating systems, or firmware. As a result, feature overlap is not only common but expected, especially in early-stage traffic such as DHCP and DNS exchanges. The multi-label model reflects this ambiguity by accommodating prediction uncertainty and capturing all plausible class associations for a given input instance.
This design is particularly aligned with Zero Trust enforcement, where the objective is not strict device identification but safe policy assignment. Devices that share similar behavioral fingerprints, even across different vendors or product names, typically require comparable network access profiles. The multi-label approach enables grouping such devices and applying a shared least-privilege policy, ensuring that access is restricted to only the necessary services while avoiding over-permissive or overly restrictive decisions.
Instead of selecting a single most probable class label, the system uses probabilistic thresholding to identify a set of candidate labels. For each input feature vector x, the model returns a vector of probabilities:
where
K is the total number of classes and
is the estimated probability that input
belongs to class
.
To convert this into a multi-label decision:
The class with the highest predicted probability is identified:
All classes
such that:
are retained, where
is a tunable probability drop threshold.
The probability drop threshold is calibrated based on the probability distribution characteristics of each model. The goal of this calibration is to ensure that the selected set of labels reflects meaningful alternatives while avoiding the inclusion of low-confidence classes.
For each model, the ranked class probabilities are analyzed to evaluate how rapidly the probability decreases from the top predicted class. This relative probability drop provides an indication of how concentrated or dispersed the model’s confidence is across classes. Based on this observation, is determined per model using the empirical distribution of relative probability drops observed across all device classes. This results in a model-specific range rather than a single fixed value. The selected value is then chosen within this range to balance label selectivity and robustness to feature overlap.
The detailed analysis of probability decay behavior and the resulting calibrated
ranges for each model are presented in
Section 9.
The output is a set of class labels per instance, which are then analyzed to:
8. Unknown Device Detection Methodology
The proposed framework integrates unknown device detection into the multi-label classification pipeline by analyzing the distribution of class probabilities produced by standard ML models. The detection mechanism assumes that legitimate devices will produce confident, concentrated predictions, while unknown devices will result in low-confidence or dispersed probability distributions.
8.1. Probability-Based Detection
Each test sample is processed through the trained classifier to obtain a probability vector over all known classes. The following methodology is applied:
Top-N Thresholding:
A probability thresholding strategy selects all classes whose probabilities fall within N% of the highest class probability. This accounts for prediction uncertainty and reflects behavioral similarity among device classes.
Cumulative Probability Scoring:
The probabilities of the selected classes are summed to compute a confidence score. Lower cumulative scores indicate higher uncertainty.
This design eliminates the need for a separate anomaly detection model, enabling real-time unknown device detection within the classification stage itself.
8.2. Model Behavior and Observations
NB and MLP consider all input features during probability estimation.
- −
NB assumes conditional independence between features and estimates posterior probabilities using Bayes’ theorem.
- −
MLP applies a softmax layer at the output, yielding normalized probability vectors.
DT classifiers, implemented using the Classification and Regression Trees (CART) algorithm with Gini impurity, adopt a greedy splitting strategy, selecting the most discriminative features at each node.
This local optimization makes DTs prone to overconfident predictions, especially when feature overlap exists between known and unknown devices.
The
Gini impurity at a node is defined as:
where
;
;
.
RF, as an ensemble of decision trees, mitigates overfitting and individual tree bias by:
- −
Averaging predictions from multiple randomized trees;
- −
Introducing feature randomness at each split.
As a result, RF achieved the best balance between classification accuracy and unknown device detection precision, exhibiting high confidence for known devices and low probabilities for unknown devices.
9. Results and Evaluations
It is important to distinguish between single-label and multi-label evaluation in the reported results. The baseline single-label performance presented in
Section 9.1 reflects standard classification metrics, while the near 100% accuracy reported in subsequent sections corresponds to the multi-label evaluation setting, where multiple labels are selected using model-specific calibrated
thresholds. This distinction is critical, as the multi-label framework allows multiple valid class assignments in cases of feature overlap.
9.1. Baseline Single-Label Performance Metrics
All eight models evaluated in this study were trained and tested using the same dataset, pre-processing pipeline, and multi-label evaluation criteria (except
value as discussed before). Consequently, they share the same confusion matrix, shown in
Figure 5 below, and baseline classification metrics. It is important to note that these baseline metrics reflect the inherent ambiguity in the dataset: several IoT devices exhibit identical network behavior due to using the same hardware, software, and service initialization sequences despite being marketed under different vendor labels. This leads to overlapping feature representations and necessitates multi-label classification.
The averaged results across five folds are as follows:
Accuracy: 87.10%.
Precision: 70.54%.
Recall: 75.00%.
F1 Score: 71.25%.
Although DNS query data was included in the second experiment to improve device differentiation, it did not result in a measurable improvement in classification performance. This suggests that devices sharing the same MAC OUI and DHCP options often begin by contacting the same domains, following similar initialization patterns. As a result, the addition of DNS data did not contribute meaningful discriminatory power during classification. However, the main difference between the models was in processing time for training and testing, as shown in
Table 3, which was primarily influenced by both the model architecture and the feature vectorization method.
The observed classification performance and confusion patterns, as shown in
Figure 5, are strongly influenced by the degree of feature similarity among devices. While the classification accuracy remains consistent across implementations, minor variations in precision, recall, and F1-score may occur for the MLP model due to stochastic training effects and evaluation conditions. These differences do not impact the overall conclusions or comparative analysis presented in this work. Based on their network behavior class constructed from MAC OUI, DHCP options, and DNS domains, the devices can be broadly categorized into three groups:
Group 1—Four Identical Devices:
Four devices: teckin, yutron, heimvisionlamp, and atomicoffeemaker exhibited identical feature sets.
Group 2—Two Identical Devices:
gosund and globelamp also displayed matching features.
Group 3—Unique Devices:
Most of the remaining unique devices exhibited partial overlap in features with one or more other devices. While they had distinguishing attributes, shared elements in DHCP requests, DNS requests, or vendor identifiers led to occasional confusion. Notably, amcrestcam and lgtv were the only devices with fully unique fingerprints, showing no overlapping features with any other device in the dataset. These two served as control cases for evaluating classification and unknown device detection under ideal conditions.
9.2. Multi-Label Framework
To evaluate the confidence levels of the classification models under a multi-label framework, the average sum of probabilities for the selected label sets was computed for each device class. The decision tree-based models (DT-CV, DT-TFIDF, RF-CV, and RF-TFIDF) and the multilayer perceptron variants (MLP-CV and MLP-TFIDF) consistently achieved average summed probabilities near 1.0 across all classes. This indicates that these models exhibit high confidence in their predictions, even when multiple labels are considered. The multi-label results presented in this section are based on model-specific
values, which were calibrated using the probability distribution analysis described in
Section 9.3.
As an illustrative example, the RF-CV model (
Figure 6) maintains summed probabilities close to 1.0 across all device classes while keeping the number of selected labels (N) low, demonstrating well-calibrated and confident predictions.
In contrast, the Naïve Bayes classifiers (NB-CV and NB-TFIDF) showed greater variability in the summed probabilities per class, with some classes reaching only around 0.6, as shown in
Figure 7. This behavior is attributed to the probabilistic formulation of Naïve Bayes, which tends to distribute probability mass across multiple classes when feature overlap exists, resulting in less concentrated and lower-confidence predictions. These results highlight the robustness of ensemble-based and deep models in generating well-calibrated probabilistic outputs for multi-label decision making.
To further examine this behavior, the minimum average summed probability across all device classes was computed for each model. This metric captures the lowest confidence observed per model and highlights how each classifier behaves under the most ambiguous class conditions.
Table 4 summarizes the class associated with the minimum average probability for each model, along with the corresponding number of selected labels (N). Consistent with the previous observations, the decision tree-based models (DT-CV and DT-TFIDF) maintain a minimum summed probability of 1.0, indicating that they assign full confidence even to their least certain class predictions. Similarly, the Random Forest models also achieve minimum values of 1.0, reflecting consistently high confidence across all classes. The multilayer perceptron models (MLP-CV and MLP-TFIDF) exhibit minimum summed probabilities close to 1.0 (99.08% and 99.07%, respectively), further confirming their strong confidence behavior.
In contrast, the Naïve Bayes models show significantly lower minimum summed probabilities, with NB-CV reaching 66.44% and NB-TFIDF dropping to 35.78%. This behavior aligns with earlier observations, where Naïve Bayes distributes probability mass across multiple classes when feature overlap exists, resulting in reduced confidence for certain device classes. These results reinforce that DT, RF, and MLP models consistently produce high-confidence predictions, while NB reflects higher uncertainty under challenging classification conditions.
9.3. Delta () Calibration
To validate the proposed
calibration strategy, we analyze the probability distribution behavior of each model across ranked class predictions. This analysis provides empirical evidence of how probability mass is distributed and supports the selection of model-specific
values.
Figure 8 shows the relative drop from the maximum probability across ranked class predictions for all models. Tree-based models such as Decision Tree (DT) and Random Forest (RF) exhibit an abrupt drop, where the top-ranked class captures nearly all the probability mass. In contrast, Naïve Bayes (NB) demonstrates a more gradual decline, indicating that multiple classes retain comparable probability values. Multi-Layer Perceptron (MLP) models show intermediate behavior, with smoother decay compared to tree-based models but more concentrated than NB.
To quantify this behavior, the empirical range of
values was computed for each model based on the minimum and maximum relative probability drops observed across all device classes.
Table 5 summarizes these calibrated ranges.
These results confirm that should not be treated as a fixed parameter across models. Instead, it must be adapted to the underlying probability distribution characteristics. In practice, representative values were selected within these calibrated ranges to balance label selectivity and robustness to feature overlap. This calibration plays a key role in improving both multi-label classification stability and unknown device detection performance.
The
values presented in
Table 6 were used in all multi-label experiments. These values correspond to the minimum calibrated
for each model, after applying a step adjustment of 0.05 to avoid
and exact boundary values, which can lead to unstable or overly sensitive label selection. This ensures stable and consistent behavior across all models. It is important to note that the near 100% accuracy reported in this study corresponds specifically to this
-based multi-label evaluation framework.
9.4. Confidence Margin for Unknown Device Detection (CMUD)
To evaluate unknown device detection, three representative scenarios were analyzed based on the level of feature similarity between unseen devices and known classes:
Unique unknown device: The device shares no features with any class observed during training.
Partially overlapping device: The device shares some features with known classes while retaining distinct attributes.
Indistinguishable device: The device exhibits identical to one or more known classes.
These scenarios capture increasing levels of classification ambiguity and allow a structured evaluation of model behavior.
Table 7 presents the sum of probabilities and the number of selected labels (N) for three representative devices: lgtv (unique), roomba (partially overlapping), and teckin (indistinguishable). The number of labels (N) reflects how many classes were selected based on the probability threshold, while the summed probabilities indicate overall model confidence. A large N combined with low cumulative probability suggests that the model is uncertain and distributes probability mass across multiple classes, which is a strong indicator of an unknown device.
The results show that Random Forest models, particularly RF-CV, are the most reliable in identifying unknown devices. For the unique device lgtv, RF-CV produced a low summed probability of 20% with only two selected labels (N = 2). Similarly, for the partially overlapping device roomba, RF-CV yielded a summed probability of 38% with N = 1. These low-confidence values clearly indicate uncertainty and allow the model to correctly flag both devices as unknown.
In contrast, Decision Tree models exhibit overconfident behavior. Both DT-CV and DT-TFIDF assign a single class with 100% probability for all cases, including unseen devices. This behavior is a result of the greedy splitting strategy used in CART, where hard partitions of the feature space lead to confident predictions even when the input does not belong to any known class.
Naïve Bayes and MLP models show less consistent behavior. Naïve Bayes often produces diffuse probability distributions, such as assigning up to 12 labels for lgtv, indicating high uncertainty but reduced discriminative capability. MLP models, on the other hand, tend to produce high confidence predictions even for partially overlapping devices, which limits their effectiveness in detecting unknowns.
The indistinguishable device scenario highlights a fundamental limitation of behavior-based classification. The device teckin shares identical features with multiple known devices, resulting in high summed probabilities across all models. Consequently, all classifiers identify it as a known device, demonstrating that unknown detection is not possible when feature representations are identical.
Overall, these results show that effective unknown device detection depends on both probabilistic uncertainty and feature divergence. Models such as Random Forest, which balance confidence without forcing extreme predictions, provide more reliable detection for unknown and partially overlapping devices.
To further quantify the robustness of unknown device detection, we introduce the Confidence Margin for Unknown Detection (CMUD), a metric designed to capture the separation between known and unknown device confidence levels. While previous analysis relied on the number of selected labels (N) and the cumulative probability, CMUD provides a more explicit measure of how confidently a model distinguishes unknown devices from known classes.
The CMUD metric is defined as the difference between the cumulative probability assigned to the selected labels and a reference confidence level associated with known devices. A higher positive margin indicates strong confidence in classification, whereas low or negative values indicate uncertainty and potential unknown device behavior.
In this analysis, two representative unknown device scenarios were considered:
The indistinguishable device scenario (e.g., teckin) was not included in this analysis, as such devices are behaviorally identical to known classes and cannot be reliably differentiated by any behavior-based model.
Figure 9 presents the CMUD values across all evaluated models for both devices.
The results show clear differences in model behavior:
Naïve Bayes (NB) and MLP-TFIDF produce negative CMUD values for both devices, indicating that these models fail to separate unknown devices from known classes and are therefore unreliable for unknown detection.
MLP-CV demonstrates positive CMUD values, indicating improved capability to identify unknown devices. However, the margin decreases significantly for the partially overlapping device (roomba), suggesting reduced robustness when feature similarity exists.
Random Forest (RF) models achieve the highest and most consistent CMUD values across both scenarios. In particular, RF-TFIDF provides the strongest separation, confirming its robustness in handling both unique and partially overlapping unknown devices.
These observations reinforce earlier findings that ensemble-based models provide better-calibrated probability distributions for open-set conditions.
Based on the observed margins, a practical threshold for unknown device detection can be defined. Rather than using a fixed absolute threshold, the decision boundary can be derived relative to the minimum confidence observed for known devices. Specifically, a conservative margin of approximately 5–10% below the minimum known-device confidence can be used as an initial threshold. This approach provides a safety buffer to account for unseen unknown device variations while minimizing false positives.
Overall, CMUD offers a simple yet effective metric to quantify model confidence behavior in open-set scenarios and provides a systematic way to define detection thresholds for real-world deployment.
Table 8 summarizes the Confidence Margin for Unknown Detection (CMUD) values across all models for two representative unknown device scenarios: a unique device (lgtv) and a partially overlapping device (roomba). Negative CMUD values, as observed in NB and MLP-TFIDF, indicate an inability to distinguish unknown devices from known classes. While MLP-CV achieves positive margins, its performance degrades significantly in the presence of feature overlap. In contrast, Random Forest models consistently produce higher and more stable margins across both scenarios, with RF-TFIDF yielding the strongest separation. These results confirm that ensemble-based models provide more reliable confidence behavior for unknown device detection. The CMUD values reported in this section are derived using the selected calibrated
values as defined in
Section 9.3. It should be noted that different
values may lead to variations in the resulting margins and detection behavior. To provide a more robust and model-agnostic assessment of uncertainty, an entropy-based analysis is also presented in the following section.
9.5. Entropy-Based Discussion for Unknown Device Detection
To further analyze model behavior under uncertainty and complement the probability-based evaluation, an entropy-based analysis was conducted on the predicted class distributions. Entropy provides a quantitative measure of how concentrated or dispersed the probability distribution is across classes. Lower entropy indicates confident predictions, while higher entropy reflects increased uncertainty.
Entropy Experimental Setup
For each model and device instance, entropy was computed from the predicted class probabilities as:
To ensure comparability across models, entropy was normalized to the range
:
where
K is the total number of classes.
Two representative unknown-device cases were analyzed: lgtv and roomba. These two devices represent different open-set conditions. The lgtv device is a clear unknown case because its feature pattern is distinct from the known classes. In contrast, roomba is more challenging because its behavior overlaps with known classes and is close to a multi-label scenario. Therefore, entropy is expected to separate lgtv more clearly than roomba.
For each model, the normalized entropy of the unknown device was compared against the maximum normalized entropy observed among the known classes. The entropy margin was computed as:
where
is the normalized entropy of the unknown device and
is the maximum normalized entropy among all known classes for that model. A positive value of
indicates that the unknown device produced more uncertainty than any known class, which means the model successfully identified it as unknown using entropy. A negative value indicates that the model failed to identify the unknown device based on entropy, since the unknown sample was not more uncertain than the known-device baseline.
Figure 10 shows the normalized entropy distributions for lgtv and roomba across all evaluated models.
The figure shows a clear difference between the two unknown scenarios. For lgtv, most models, except DT, produce a pronounced entropy peak, indicating that the device is recognized as highly uncertain relative to known classes. This confirms that entropy is effective when the unknown device is behaviorally distinct. In contrast, roomba produces much smaller and less consistent entropy increases. Its entropy often remains close to the entropy of known classes, reflecting its partial feature overlap and near-multi-label nature. This makes roomba substantially harder to identify as unknown using entropy alone.
Table 9 summarizes the exact normalized entropy values for both unknown devices, together with the maximum known-class entropy and the resulting entropy margin.
The table confirms that entropy separates the two cases very differently. For lgtv, all models except DT produce positive entropy margins. The strongest separation is obtained by MLP-CV (), followed by MLP-TFIDF (), NB-CV (), RF-CV (), and RF-TFIDF (). NB-TFIDF also succeeds, but with a smaller margin of 0.235. In contrast, both DT variants completely fail, producing zero entropy for lgtv and a negative margin of . This indicates that DT remains overconfident even for a clearly unknown device.
For roomba, the behavior is more difficult and much less consistent. Only NB-TFIDF and RF-CV achieve positive margins, with and , respectively. These values are small, indicating only weak separation from known classes. RF-TFIDF is nearly neutral, with , showing that it is very close to the detection boundary but still fails under this criterion. NB-CV (), MLP-CV (), MLP-TFIDF (), and both DT models () fail to identify roomba as unknown based on entropy. This supports the observation that roomba is a challenging case because its behavior overlaps with known classes and resembles a near multi-label condition rather than a fully distinct unknown.
9.6. Out-of-Dictionary Scenario as a Special Case of Unknown Devices
In the previous experiments (
Section 9.1,
Section 9.2,
Section 9.3,
Section 9.4 and
Section 9.5), all evaluated samples were derived from the same feature space as the training data. That is, even when devices were treated as unknown, their features were still part of the learned vocabulary (dictionary). As a result, these experiments represent in-dictionary open-set conditions, where unknown devices exhibit either partial or full overlap with known feature representations.
To evaluate a more general and practically important scenario, we consider the out-of-dictionary case, where a device generates features that are completely unseen during training. In the proposed tokenization and vectorization pipeline (
Section 4), such features are not included in the learned vocabulary. Consequently, after vectorization (both CV and TF-IDF), the resulting feature vector becomes an all-zero vector.
This scenario represents the most extreme form of unknown device behavior, where the observed device is entirely outside the learned feature space. Importantly, this case generalizes any situation where new tokens (e.g., unseen DHCP options, vendor identifiers, or DNS domains) are encountered, making it a strong baseline for evaluating open-set robustness.
Out-of-Dictionary Experimental Setup
To simulate this condition, an all-zero input vector was directly provided to each trained model. The resulting probability distributions were analyzed using the same metrics defined earlier:
In this experiment, the selected calibrated
values as defined in
Section 9.3 were used. The results are summarized in
Table 10.
The results reveal fundamentally different behaviors across model families under out-of-dictionary conditions:
Naïve Bayes (NB) produces a uniform probability distribution across all classes (), resulting in maximum entropy (1.0). This reflects complete uncertainty and is consistent with the absence of observed features.
Decision Tree (DT) models assign full confidence to a single class (, probability = 1.0), leading to zero entropy. This behavior is caused by the deterministic structure of the tree, which routes all-zero inputs to a fixed leaf node, resulting in overconfident and unreliable predictions.
Random Forest (RF) models significantly reduce confidence (sum probabilities ≈ 0.14–0.15) while maintaining high entropy (>0.8). This indicates uncertainty and better calibration compared to DT, as ensemble averaging prevents extreme predictions.
MLP models exhibit high entropy (>0.92) with low summed probabilities, reflecting a distributed uncertainty across multiple classes. This behavior is consistent with softmax-based probabilistic outputs under ambiguous inputs.
9.7. Decision Criteria and Limitations in Unknown Device Detection
To ensure a clear and reproducible decision pipeline, the final unknown detection decision should be based on a combined evaluation of cumulative probability and entropy. A sample is classified as unknown when it exhibits both lower confidence than known devices and higher entropy than known devices, i.e., when the cumulative probability falls below that observed for known device classes while the entropy exceeds the corresponding known-device range. This condition captures cases where the model lacks a dominant class assignment and provides a consistent and unified decision criterion.
Among the evaluated models, the RF-CV model demonstrated the best overall performance in terms of stability and confidence separation. The proposed criterion is effective for clearly distinguishable unknown devices. However, scenarios involving overlapping features with known devices remain inherently challenging. In such cases, the entropy values tend to be close to those observed in multi-label conditions, reflecting similar levels of uncertainty and ambiguity in the prediction space. As a result, these cases may not consistently satisfy the unknown detection thresholds, making them difficult to distinguish from valid multi-label outcomes. This limitation highlights the need for a layered security approach. In particular, indistinguishable or highly overlapping device behaviors cannot be reliably resolved using early-stage metadata alone. Addressing such cases may require extended device profiling over longer observation periods and the integration of multiple data modalities (e.g., additional traffic features, temporal behavior patterns, or higher-layer protocol characteristics). These complementary mechanisms can enhance separability and improve robustness in real-world unknown device detection scenarios.
10. Limitations and Future Work
While the proposed classification approach demonstrates high accuracy and strong alignment with Zero Trust principles, several limitations must be acknowledged.
The evaluation was conducted using a dataset collected in a controlled laboratory environment with a limited number of IoT devices. While this setup enables systematic analysis, the dataset size and device diversity remain constrained and may not fully capture the heterogeneity of real-world deployments. Accordingly, the presented results should be interpreted as a proof-of-feasibility of the proposed approach under controlled conditions rather than a fully validated solution for large-scale, heterogeneous IoT environments. Further validation on more diverse device populations and real-world network settings is required to assess generalization and robustness.
The DHCP metadata used for classification assumes that certain option fields are consistent across similar device models and operating systems.
While Deep Packet Inspection (DPI) offers granular visibility for behavioral classification, it introduces significant computational overhead, especially in resource-constrained IoT gateways. To mitigate this, DPI processing can be offloaded to dedicated containers or virtual machines deployed at the network edge.
The increasing adoption of encrypted DNS protocols (e.g., DoH, DoT) may reduce visibility into DNS queries over time. Future work may incorporate techniques for encrypted traffic analysis or explore alternative unencrypted metadata features to maintain classification accuracy.
The model may struggle to identify devices that exhibit behavior identical to known types, highlighting a limitation in distinguishing truly unknown devices.
Devices with static IP addresses, static host entries, or cached DNS records cannot be reliably classified under the proposed method and will be denied network access.
Future research can focus on integrating this classification mechanism with other security layers, including device authentication, continuous traffic visibility, policy enforcement, and behavioral anomaly detection. Together, these layers can form a holistic security framework aligned with Zero Trust principles.
The proposed classification system can also be extended to enhance MUD deployments. In particular, it may serve as a trigger to initiate communication with a MUD server in cases where a device does not broadcast a MUD URL via DHCP, thus enabling automated policy retrieval based on the inferred device type. Furthermore, the classification engine can integrate with private MUD controllers or centralized policy enforcement platforms that maintain behavior and access policies for each device category. When a new device is detected, the policy associated with its predicted class can be transmitted to the IoT gateway to enable immediate enforcement.
11. Conclusions
This study demonstrated a lightweight and effective approach for early-stage IoT device classification and unknown device detection using only DHCP and DNS metadata. By extracting protocol-specific features from the first observable packets during device boot-up and representing them as tokenized behavior classes, the proposed system enables real-time analysis with minimal computational overhead.
Experimental results show that high classification accuracy can be achieved even in the presence of feature overlap between devices from different vendors. In particular, the Random Forest (RF) model consistently outperformed other classifiers in both accuracy and unknown device detection performance, striking a favorable balance between robustness and efficiency. While decision tree-based models tend to be overconfident, and Naïve Bayes shows variable probabilistic behavior, the RF model captured non-linear feature interactions effectively.
Unknown device detection was made possible by evaluating the number of labels (N) required to reach a confidence threshold and the total probability mass assigned to those labels. Devices that were behaviorally distinct or only partially overlapping with known devices were correctly identified as unknown. However, devices that were identical in behavior to known types could not be flagged as unknown, highlighting a theoretical limitation for any behavior-based system.
Overall, the dual capability of a single ML model to handle both classification and unknown device detection is a significant advancement for constrained IoT gateways. It reduces architectural complexity, minimizes resource usage, and facilitates early enforcement of Zero Trust principles such as least-privilege access and network microsegmentation. Future work will explore integrating additional passive features and extending the methodology to encrypted traffic scenarios.