1. Introduction
The Internet of Things (IoT) now spans smart homes, wearable healthcare, industrial control, and critical infrastructure. Heterogeneous devices, decentralized architectures, and constrained resources expose IoT networks to diverse attacks. In particular, higher-layer threats such as botnet-driven DDoS, obfuscation-assisted malware evasion, and false-data injection in cyber–physical systems are well documented and have shaped L3–L7-centric defenses [
1,
2,
3]. By contrast, link-layer integrity remains uneven in practice.
We therefore focus on ARP spoofing—an often overlooked Layer 2 vector that poisons IP–MAC bindings in the broadcast domain to enable man-in-the-middle interception, modification, or disruption. While enterprise networks may mitigate the risk with managed switches, VLANs, or Dynamic ARP Inspection, many IoT deployments lack uniform policies and device agents, creating a persistent blind spot. This motivates lightweight, time-windowed link-layer features and resource-aware classifiers tailored for constrained environments.
Adopting static ARP tables, Dynamic ARP Inspection (DAI), and DHCP snooping, can provide limited protection. Nonetheless, these approaches often depend on stringent configuration policies and uniform infrastructure controls, which are not feasible due to the lack of resources or the ever-changing nature of IoT settings. In many instances, IoT deployments function within loosely managed networks–occasionally even behind consumer-grade routers–where configuring infrastructure level is often untenable. Furthermore, static or signature-based defenses do not function well because IoT devices regularly change IP or MAC associations because of mobility, firmware updates, and power cycling. As a result, there needs to be more adaptive, intelligent, and resource-aware mechanisms to detect spoofing behaviors that exploit ARP’s trust by default property.
Link-layer attacks such as ARP spoofing remain undetected in many instances. On the contrary, the same cannot be said about attacks at other layers such as HTTP or TCP. The two latter threats see huge attention in comparison with the former threats. Currently, solutions for intrusion detection systems (IDS) with deep packet inspection are designed and optimized up to Layer 3. As a result, such solutions likely miss some layer-wise patterns. Strategies for securing IoT must include a solution for such a blind spot. Interestingly enough, many legacy or reduced footprint IoT operating systems do not support a host-based firewall agent. Also, sophisticated anomaly detection may not be possible for every IoT device because of CPU and memory.
Because of these challenges, there is need for some lightweight and real-time ARP spoofing detection systems which do not need heavy feature extraction and processing. The system must be resilient when the structure of a network changes due to IoT. Moreover, the system must be able to detect unknown attack variations.
Although the approaches based on machine learning show promise in enabling adaptive detection of threats, they require a reasonable amount of labelled data for training in addition to features engineered to capture spoofing specifics. Publicly available datasets and testbeds are not primarily focused on the IOT ARP layer. Consequently, a lack of robust empirical grounding hampers researchers’ ability to develop and test their models, creating significant obstacles in their work.
1.1. Motivation and Problem Statement
Although machine learning has yielded good results in many intrusion detection tasks, its application to ARP spoofing detection faces challenges for multiple reasons. Chief among these is the lack of appropriate datasets. The IoT network datasets available today mainly consist of upper layer attacks such as DDoS, dictionary brute-forcing attempts, or MQTT-based attacks. As such, ARP-level data are either very poorly represented in the literature or completely lacking. There is either a lack or inadequacy of any spoofing attack label on ARP traffic where it is logged, rendering supervised learning unusable.
The ACI-IoT-2023 dataset [
4] is a recent dataset that contains the raw PCAP files from real or emulated IoT environments. Importantly, it records times when ARP spoofing is thought to take place. However, the extracted features from this dataset focus on higher-layer attacks only and miss out on the essential ARP indicators. These include request/reply ratios, MAC/IP anomalies, or timing anomalies. Without these features and labels, researchers will face difficulties in re-extracting the traffic from the raw PCAPs, manually identifying and labelling spoofing attacks and engineering domain-specific features. If not conducted systematically, it is a long process and can be inconsistent.
Thus, the task transcends mere categorization. Researchers may find packets in the spoofing window, but capturing the ARP spoofing essence requires a time-based analysis. ARP spoofing frequently has a brief duration. The attacker sends replies over time to make the victim save wrong MAC-IP connections. To highlight this tendency, one needs to survey the ARP traffic through time-window analysis along with temporal aspects. Likewise, many IoT systems utilize energy-saving modes which can create momentary spikes or sporadic disconnections in signal light. One can observe a static IP and MAC entity which helps in monitoring the functionality of some critical hosts in the network system.
Given these issues, the main objective is to enhance ARP spoofing detection with additional ARP features and labels in an existing IoT dataset and to develop a machine learning pipeline that is resource-friendly and realistically deployable in IoT environments. A solution like this would serve both researchers benchmarking new detection techniques as well as practitioners requiring utilities to secure their deployments.
1.2. Motivation for a Feature-Rich ARP Dataset
ARP spoofing detection goes beyond mere labeling, and it entails the analysis of link-layer behavior. Within packet frequency, opcode, and MAC-IP binding distributions, ARP spoofing will likely manifest as an anomaly rather than the spike. It is important to engineer and analyze time-window features. By summing the ARP packets in the given time interval, one can compute statistics such as
Frequency Ratios: The ratio of ARP requests to ARP replies.
Unique Mappings: The number of unique MAC addresses per IP or unique IPs per MAC.
Timing Metrics: Average inter-arrival times, standard deviation of arrival times, and spikes in ARP replies.
Analyzing packets in isolation cannot reveal any anomalies but using those features helps detect them. An aggressor may employ only a handful of malicious responses, spaced out over time and hiding them amongst ordinary packets. Time-window amalgamation is useful in these instances as they expose fundamental variance of normal activities.
It is not easy, however, to engineer these features, particularly for large PCAPs and IoT topologies. “All the devices in the experiment are timestamped in sync, aligned with the known time a DOS attacks occurs, and any device that goes down periodically (switching off or to sleep state) is properly handled.” This highlights the importance of a robust and systematic ARP spoofing extraction pipeline to create effective training and testing datasets.
1.3. Research Contributions
This research contributes to the IoT security landscape in several key ways:
Deployment-Oriented Efficiency Profiling. Whereas many studies report only classification metrics (e.g., accuracy, precision, recall, and F1), we add deployment critical measurements Inference Latency, RAM Usage, and Model Size so model selection is guided by both detection quality and resource feasibility on IoT/edge devices.
Time-Window ARP Modeling. We treat ARP spoofing as a frame pattern and choose the window length by how long spoofing typically persists in practice. This turns window choice from trial-and-error into an explainable, repeatable setting that remains effective without heavy payload parsing—useful for constrained IoT devices.
ARP-Focused Dataset Enablement. We reconstruct ARP-layer labels and derive link-layer indicators from public PCAPs where such features are absent in extracted CSVs, enabling supervised learning and reproducible benchmarking for L2 spoofing.
Reproducible Pipeline with Explanatory Ablations. We document a transparent, scriptable pipeline—from labeling through windowed feature construction to training/evaluation under a fixed stratified split with a held-out test set—and include ablations that explain observed performance patterns rather than reporting aggregates only.
1.4. Organization of the Paper
The remainder of the paper is organized as follows:
Section 2, in which we review related work on MITM detection for the IoT, as well as a survey of public IoT datasets with respect to coverage and gaps in MITM. In
Section 3, we present our approach to label data in accordance with the attack timeline. In
Section 4, we describe our experimental setup and evaluation protocol, including ablation and generalization analyses.
Section 5 summarizes the detection results over varying window lengths and classifiers, and models the trade-offs and snapshot efficiency. The implications, limitations, and deployment considerations are discussed in
Section 6.
Section 7 concludes the paper. To finish, in
Section 8, we describe possible future work, in particular, adaptive time-window and other link-layer protocol extensions.
5. Results
The performance metrics of the ARP spoofing detection models are discussed in this section. These metrics include the accuracy of detection, the computational cost of the models, and the performance with various time-window. The metrics used for evaluation are the classification performance (accuracy, precision, recall, and F1-score) and the consumption of computational resources (train time, inference latency, model size, and RAM consumption). All results are achieved from experiments in resource-constrained settings where model inference was limited to one CPU core.
5.1. Classification Performance Analysis
To evaluate the impact of different time-window w on model performance, we conducted experiments using w ∈ {60 s, 300 s, 600 s, …, 3000 s}. Larger time-windows provide more temporal information but may also introduce redundant data.
When w exceeds the longest spoofing duration documented in the ACI-IoT-2023 timesheet (∼1800 s), metrics tend to plateau or show slight regression due to temporal dilution benign and attack segments being averaged within a single window, so we regard ∼1200–1800 s as a practical operating regime. The decrease from 1500 s to the next step is a local transition effect: boundary aliasing briefly mixes benign and spoofed segments, the number of windows falls and window-level class proportions shift, and several discrete indicators (e.g., MAC-IP uniqueness) change non-monotonically; with a slightly larger w, aggregation realigns with spoofing persistence and the metrics recover.
Table 6 summarizes the classification performance of each model across different time-windows. The results indicate that an optimal
w exists where the balance between detection accuracy and computational overhead is maintained.
5.2. Confusion Matrices
To further analyze model performance,
Figure 2 presents confusion matrices for the XGBoost and Random Forest models, which demonstrated the best classification accuracy.
5.3. Computational Efficiency Analysis
Per-sample inference time is essentially independent of w because prediction operates on a fixed-size feature vector; what grows with larger windows is the pre-aggregation waiting time required to accumulate one window before prediction—the external time to first decision.
In addition to classification accuracy, we evaluated the computational efficiency of different models under varying time-window w. Larger w values incorporate more statistical context but also increase RAM consumption.
Table 7 presents the computational cost associated with each model, considering training time, inference latency, model size, and RAM usage across different
w values. These results highlight the trade-off between accuracy and resource consumption, which is crucial for real-world IoT deployments.
5.4. Performance Visualization
Figure 3,
Figure 4,
Figure 5 and
Figure 6 visualize the trends of accuracy, precision, recall, F1-score, training time, inference latency, model size, and RAM usage across different time-windows.
6. Discussion
This section analyzes the experimental results, focusing on classification performance, the impact of time-window selection, computational efficiency, and a comparison with existing research on ARP spoofing detection.
6.1. Evaluation of Classification Results
As shown in
Table 6,
Figure 3 and
Figure 4 XGBoost and CatBoost consistently achieve the highest accuracy and F1 score in most time-windows. This confirms that Gradient Boosting models are suitable for detecting ARP spoofing attack in network traffic.
Key observations include the following:
XGBoost and CatBoost: This model had the best performance and stable output in most of the time-windows. For example, at the window of 1800 s, both of them have more than 93% for accuracy and F1-score. Feature importance analysis indicates that anomaly-based indicators such as ARP reply count were significant in classification.
Random Forest: This model works almost similar in accuracy and recall as SVM in mid-to-large time-windows, but it consumes a lot more RAM and takes up a much larger model size. Therefore, it is not suitable for deployment in resource-constrained IoT devices.
Decision Tree: This model attained 92.6% accuracy at 1800s which was quite high. However, its recall was relatively low at 60 s and 390 s, and the model portrayed a high false negative rate. Thus, the model may result in undetected ARP spoofing attacks in smart systems.
KNN: This model consistently did worse than tree-based models on all metrics. Distance-based similarity in high-dimensional feature space makes it ineffective for our application. Moreover, its recall did not exceed 80% even with bigger time-windows, showing it is unsuitable for ARP spoofing.
The confusion matrices in
Figure 2 further corroborate these conclusions. XGBoost and CatBoost maintain a reasonable trade-off between false positives and false negatives across various time-window settings.
6.2. Impact of Time-Window Selection
The selection of an appropriate time-window is a critical factor in ARP spoofing detection. Our results show:
Short time-window (e.g., 60 s) captured transient anomalies, leading to higher recall but also increased false positives.
Long time-window (e.g., 3000 s) resulted in smoother feature aggregation, improving precision but slightly reducing recall.
Optimal time-window (1200 s–1800 s) achieved the best balance, maximizing F1-score while minimizing misclassifications.
Notably, precision continues to increase beyond 1800 s, reaching 0.98 at 3000 s; however, this is accompanied by a drop in recall and a growing discrepancy between training and testing accuracy. This pattern reflects a form of over-aggregation, where excessively large time-windows compress multiple behaviors into a single statistical profile. The resulting loss of temporal granularity reduces the diversity of training samples and may induce overfitting-like behavior, particularly in detecting short-lived or bursty ARP spoofing events.
This suggests that the adaptive time-window mechanism may help to dynamically adjust to changing network conditions. However, according to the authors’ timesheet, longer time-windows over 1800 s offer diminishing returns and risk degrading detection robustness due to temporal over-aggregation.
6.3. Computational Efficiency Considerations
Because of the constraints on IoT devices, efficiency is a key factor in model selection.
Table 7,
Figure 5 and
Figure 6 show comparison of all those evaluated with respect to training time with inference latency and model size and RAM usage.
Training Time: Random Forest and CatBoost required longer training times, but since training is an offline process, this is an acceptable trade-off.
Inference Latency: Decision Tree exhibited the lowest inference latency, making it highly suitable for real-time deployment.
Model Size: Random Forest had the largest model size across all evaluated models; however, its size gradually decreased as the time-window increased, suggesting that deeper trees were pruned or simplified over time.
RAM Usage: All models showed an increase in RAM usage as the time-window expanded, eventually converging to similar levels. This reflects the growing volume of features and data being processed, which imposes constraints on edge deployment scenarios.
6.4. Comparison with Existing Research
Compared to traditional ARP spoofing detection approaches, our method presents several key advantages:
Most existing research has focused on detection without considering resource constraints. We explicitly evaluate models under single-core execution and report inference latency, RAM usage, and model size in addition to accuracy, precision, recall, and F1.
ACI-IoT-2023 originally lacked extracted ARP spoofing features. We reconstruct ARP-layer labels from PCAPs and introduce ARP-specific link-layer indicators, improving dataset usability for supervised learning and reproducible benchmarking.
We model ARP spoofing as a time-based process and choose the aggregation window to reflect typical spoofing persistence, turning window selection from ad hoc sweeps into an explainable, repeatable setting suitable for constrained devices.
The approach is lightweight, operating on link-layer headers and timing only, which complements infrastructure/heuristic defenses (e.g., DAI/SDN, entropy, or binding checks) where uniform controls or agents are unavailable.
These contributions mark a significant step toward practical ARP spoofing detection in IoT networks.
6.5. Limitations
Although the results are promising, there are limitations with our study:
Dataset Scope: The ACI-IoT-2023 dataset is adopted for effectively enhancing ARP spoofing detection. However, it must be noted that it may not cover all varieties of IoT network environments. Real-world adversarial conditions may impact the generalizability of the model. Moreover, it is likely that the dataset only contains a subset of all possible ARP spoofing attack types. Future testing can be conducted on bigger, diverse datasets for improved robustness.
Computational Constraints: The results show that the application of machine learning for ARP spoofing detection can be achieved under a single-core execution environment with RAM usage below 512MB, making it feasible for IoT applications. The inference latency of the models varies and some algorithms such as KNN take a heavier toll. Models like CatBoost and XGBoost offer a good balance between efficiency and outcome, though some optimization could be necessary for extreme edge devices.
Lack of Real-Time Adaptive Mechanisms: Our models rely on predefined time-window for feature extraction, which might not adapt to changing conditions in the network. An alternate way that is more adaptive can be including a real-time anomaly detection or dynamically modulating the feature aggregation.
Security Issues: The machine learning models can be tricked by attackers who subtly change network data to avoid detection using various techniques. Improving resilience against malicious samples and looking into counteractive measures, impediments, or challenges is a significant area of research.
Data Reliability and Noise Sensitivity: Our windowed ARP features assume reasonable data quality, yet IoT traffic may include isolated/burst noise, drift, or dropouts that bias window statistics and classification. In LANs, device churn (join/leave), DHCP renewals, broadcast storms, port scans, varying host counts, or wireless interference can inflate ARP replies or unique MAC–IP mappings and trigger spurious requests. Related work [
59,
60,
61] on data reliability assessment and analytics motivates this.
Through acknowledgment of these limitations, we highlight the importance of further research into improving the adaptability, computational efficiency as well as the security of the machine learning-based ARP spoofing detection for IoT networks.
7. Conclusions
This study proposed a machine-learning-based framework for ARP spoofing detection in IoT networks. We addressed a critical limitation of ACI-IoT-2023, namely the lack of ARP-specific feature extraction and labels, by introducing a tailored feature-engineering pipeline and labeling procedure. Our empirical analysis leads to the following findings:
Effectiveness of Gradient Boosting: Across the evaluated time-window, XGBoost and CatBoost consistently delivered the strongest accuracy, precision, recall, and F1-score, reflecting their ability to capture non-linear interactions among windowed ARP features. Confusion matrices indicate reduced false positives and false negatives compared with Decision Tree, Random Forest, and KNN.
Time-Window Selection Matters: Performance generally improves as the time window grows, but only up to a point. We observed diminishing returns and early signs of overfitting beyond roughly 1800 s. Small time-windows (e.g., 60 s) can increase recall at the expense of more false positives, whereas very large time-windows (e.g., 3000 s) may increase precision yet miss short-lived attacks. A balanced window in the range of 1200–1800 s offered the best trade-off between responsiveness and robustness in our setting.
Computational Trade-offs: Ensemble models (Random Forest, XGBoost, and CatBoost) improved detection but incurred higher memory footprints and larger model sizes than a single Decision Tree. Decision Tree remained the most lightweight with competitive accuracy, while KNN showed the slowest inference and the weakest classification performance, making it unsuitable for real-time detection.
Feasibility on Constrained Devices: All models were evaluated under a single-core CPU constraint. Peak RAM usage remained below 512 MB as measured in our emulated edge environment (QEMU, single-core ARM ≈ 1.0 GHz, 512 MB RAM, no GPU, OpenWrt). We use 512 MB as a practical benchmark for lightweight feasibility rather than a hard cap; models exceeding this footprint typically target higher-performance IoT gateways or edge nodes.
Advancing ARP Spoofing Research: The proposed ARP-oriented feature set—centered on time-windowed rates, inter-arrival statistics, MAC–IP inconsistency patterns, and reply counts—improves classification over traditional threshold-based tool (e.g., Arpwatch, and ArpON), which often suffer from elevated false alarms due to fixed rules.
Enhancing ACI-IoT-2023: We augment ACI-IoT-2023 by deriving ARP-specific labels and engineered features (e.g., ARP reply count, and rolling aggregates), thereby making the dataset more suitable for reproducible ARP spoofing research and comparative evaluation.
In summary, machine learning substantially strengthens ARP spoofing detection for IoT when models are evaluated jointly on detection quality and computational efficiency. Future work will focus on robustness against adversarial manipulation, online adaptation (e.g., drift-aware models), real-time mitigation integration, and validation across diverse IoT verticals and hardware profiles, including federated or privacy-preserving training at the edge.