1. Introduction
1.1. Background and Motivation
Recent rapid advancements in Internet of Things (IoT) technologies have accelerated the digital transformation of healthcare, leading to the emergence of the concept of the Internet of Medical Things (IoMT). Thanks to the network-based integration of wearable sensors, implantable devices, patient monitoring systems, and various therapeutic tools, a fundamental transformation in healthcare delivery has begun with the IoMT. This framework has significantly enhanced the effectiveness and operational efficiency of healthcare services by enabling critical applications such as remote patient monitoring, continuous health tracking, and advanced diagnostic mechanisms. Additionally, IoMT systems improve clinical decision-making processes through real-time data analysis and enable remote healthcare management [
1]. As shown in
Figure 1, the IoMT ecosystem consists of multiple interconnected layers, including sensors, edge/fog nodes, cloud platforms, and healthcare applications.
However, the rapid proliferation of IoMT systems has also brought about serious cybersecurity risks. A review of the literature and recent incidents reveals that, particularly during the pandemic, cyberattacks in the healthcare sector have increased significantly, with IoMT devices becoming one of the primary targets of these attacks.
The highly sensitive nature of data in IoMT environments makes security breaches far more critical than in traditional IT systems. This is due to their direct impact on patient safety and human life. Data breaches, device manipulation, and service disruptions not only threaten data privacy but can also lead to system failures, resulting in interruptions to critical healthcare services [
2].
One of the key factors complicating the security of IoMT systems is their highly heterogeneous and multi-protocol architecture. The communication between devices from different manufacturers via various protocols such as TCP/IP, MQTT, Bluetooth, and others brings with it differing data structures, traffic behaviors, and security vulnerabilities [
3]. This situation leads to the generation of large-scale and complex network data, making it particularly difficult to detect cyber threats. Additionally, the limited processing power, memory, and energy capacity of IoMT devices restrict the implementation of traditional security mechanisms, making these systems more vulnerable to cyberattacks [
4].
In this context, intrusion detection systems (IDSs) play a critical role in detecting and preventing cyber threats in IoMT environments. IDSs continuously monitor network traffic to detect abnormal behavior and provide an early warning mechanism against potential attacks [
5]. In particular, machine learning and deep learning-based IDS approaches offer significant advantages due to their ability to analyze patterns in complex and high-dimensional network data. However, a significant problem stands out when reviewing current studies. Many machine learning and deep learning-based approaches fail to adequately represent the multidimensional and multi-source nature of network traffic, which can limit model performance [
6].
In particular, the “Data Distribution Shift” problem demonstrates that structural differences between IoT and IoMT environments directly impact IDS performance. A review of the literature reveals that experimental studies show models trained on general IoT datasets experience significant performance drops when tested in IoMT environments. This situation clearly highlights the necessity of IoMT-specific datasets and domain-specific modeling approaches [
7].
Furthermore, the real-time data flow, low-latency requirements, and high criticality of IoMT environments make the adequacy of existing IDS solutions even more questionable. Issues such as high false alarm rates, limited explainability, and a lack of adaptability to new attack types reduce the reliability of these systems in clinical settings [
8]. However, a review of the literature reveals that recent studies suggest deep learning models based on multi-scale feature extraction and attention mechanisms may contribute to a more effective representation of complex IoMT traffic structures.
In conclusion, ensuring security in IoMT ecosystems is not merely a technical requirement but a critical necessity directly linked to patient safety. Therefore, the development of next-generation IDS solutions that account for the multi-protocol and heterogeneous nature of IoMT environments and possess high accuracy, low latency, and strong generalization capabilities is of great importance. In particular, advanced deep learning architectures, domain-specific datasets, and effective data representation techniques are emerging as key determinants in the construction of future secure and sustainable healthcare systems.
1.2. Problem Definition
Cyberattack detection in IoMT-based healthcare systems is a complex and multifaceted problem. This stems from the heterogeneous architecture, high-volume data flows, and critical requirements related to patient safety [
9,
10]. The diverse devices, communication protocols, and cloud integrations within the IoMT architecture expand the attack surface. This situation increases the diversity of threats while making the detection process more difficult. Therefore, attack detection must be approached differently from the classical anomaly detection methods used in information systems [
11]. The detection of cyberattacks targeting IoMT-based healthcare systems should be treated as a learning problem. This problem requires the generation of reliable decisions based on multi-source, real-time data. In IoMT environments, attack detection presents itself as a multi-class problem due to the architecture and heterogeneous nature of IoMT systems [
12,
13]. However, many studies in the literature address the problem as a binary classification, failing to adequately reflect the true diversity of attacks. Additionally, the diversity of attack types and their constantly evolving nature limit the effectiveness of traditional methods [
14].
Another significant challenge is data imbalance. The underrepresentation of rare but critical attacks prevents models from adequately learning these threats, leading to low sensitivity. Additionally, the lack of clear boundaries between normal and abnormal behavior further complicates anomaly detection.
The heterogeneous and high-dimensional nature of IoMT data complicates feature extraction and representation learning [
15]. Furthermore, the large volume of data increases computational costs, while real-time requirements necessitate the design of low-latency and efficient systems. Otherwise, patient safety could be directly compromised [
14,
16].
Finally, data distribution shifts and varying system configurations limit the generalization ability of models. When all these challenges are considered together, it becomes evident that current IDS approaches are insufficient. Therefore, there is a clear need for next-generation solutions tailored to IoMT environments that are low-latency, high-accuracy, and generalizable.
1.3. Limitations of Existing Approaches
Although deep learning-based IDS methods achieve high accuracy rates, they face significant limitations, such as class imbalance, variations in data distribution, and high computational costs, which restrict their applicability in real-time IoMT environments. A review of the literature on machine learning and deep learning-based attack detection in IoMT systems reveals that existing approaches fail to fully address the unique dynamics and constraints of the IoMT ecosystem. Consequently, these approaches remain insufficient to meet the specific requirements of real-time, heterogeneous, and resource-constrained IoMT environments [
17]. These limitations can be broadly categorized into the following four key groups:
Limitations of Single-Scale Models: Most studies in the literature analyze network traffic over a fixed time window. These studies typically use standard Convolutional Neural Networks (CNNs) or Recurrent Neural Network (RNN) architectures based on a single feature scale [
18]. However, IoMT traffic exhibits a multi-scale structure, ranging from sensor-based instantaneous changes to long-term protocol interactions [
5]. Single-scale models are insufficient for simultaneously capturing these hierarchical features and attack patterns spanning different time scales [
13].
Data Dependency of Attention Models: Attention mechanisms used to identify complex attack types typically require large-scale, balanced datasets. In IoMT environments, however, the “class imbalance” problem, where attack data is very scarce compared to normal traffic, is a constant challenge [
19]. This makes it difficult for attention models to focus on rare but critical attack signals. Instead, models overfit to the majority class, leading to a decline in anomaly detection performance [
12,
13].
Computational Cost of Transformer-Based Methods: Transformer architectures have high model complexity and dense parameter structures. Consequently, they are incompatible with the limited processing power and memory capacity of IoMT devices. The computational cost during the training phase of these models is quite high. This conflicts with the low-latency, real-time detection requirements of the IoMT. Additionally, they rapidly lose their ability to generalize on limited domain data [
20].
Protocol Heterogeneity and Data Shift: Existing IDS solutions are typically limited to specific protocols. They cannot account for the hybrid nature of multi-protocol traffic such as TCP/IP, MQTT, and Bluetooth in IoMT networks. This issue is compounded by the “Data Distribution Shift” problem. This combination leads to unexpected performance drops in real-world clinical field applications, even for models that perform well in laboratory environments [
20,
21].
In summary, low-computational-cost architectures are required for effective attack detection. Furthermore, these architectures must be resilient to data skew. In addition, structures capable of representing the multidimensional nature of traffic are gaining importance. However, the lack of architectures with multi-scale attention mechanisms stands out as a fundamental research gap in IoMT security.
1.4. Proposed Approach Overview
This experimental study proposes an integrated approach that combines multi-scale feature extraction, attention mechanisms, and a low-computational-cost architectural design. The MSCA-Net model developed in this study offers a more effective solution to the challenges specific to IoMT environments. The proposed model uses multi-scale depthwise separable convolutional layers to capture patterns in network traffic across different time scales. This enables hierarchical feature extraction. The resulting multi-scale features are adaptively weighted using a squeeze-and-excitation-based attention mechanism. Additionally, it ensures that the most distinctive features are highlighted.
Furthermore, a lightweight LSTM-based sequential learning component has been integrated to reduce the model’s computational cost and meet real-time requirements. Thanks to this integrated architecture, MSCA-Net combines multi-scale representation learning, an attention mechanism, and computational efficiency. As a result, it delivers high accuracy, low latency, and strong generalization performance in IoMT environments.
1.5. Key Contributions
This study addresses the problem of multi-class attack detection in IoMT environments and offers the following original contributions:
Comprehensive Literature Review and Identification of Research Gaps: Existing studies on IoMT-based attack detection have been comprehensively examined. In particular, the inadequacy of single-scale models and the data dependency of attention-based methods have been highlighted. Additionally, the practical applicability issues of high-cost architectures have been emphasized. This analysis clearly demonstrates the need for a multi-scale, attention-based, and efficient architecture.
Next-Generation Multi-Scale Attention-Based Architecture (MSCA-Net): The proposed model integrates multi-scale depthwise separable convolutions, a channel attention mechanism, and a lightweight Long Short-Term Memory (LSTM) component. It introduces a unique deep learning architecture capable of effectively learning the hierarchical structure of IoMT traffic.
- ▪
Balancing Computational Efficiency with High Performance: The use of depthwise separable convolutions significantly reduces the model’s parameter count and computational cost. Nevertheless, the model maintains high accuracy and strong representational capacity, making it suitable for resource-constrained IoMT environments.
- ▪
Comprehensive and Realistic Experimental Validation: The proposed approach has been tested in multi-class scenarios using two state-of-the-art IoMT datasets with distinct data characteristics. It has been demonstrated to deliver robust and consistent performance under both large-scale and limited data conditions.
- ▪
Applicability for Real-Time and Distributed Environments: Thanks to low latency, fast training time, and high generalization capability, the proposed model delivers strong performance. Accordingly, it has been demonstrated that the model offers a viable solution for real-time attack detection on IoMT gateways and edge devices.
1.6. Paper Organization
This article is organized as follows: The
Section 1 presents the background, motivation, problem definition, limitations of existing approaches, and the main contributions of this study. The
Section 2 provides a comprehensive review of the literature on cyberattack detection in IoMT environments and identifies the existing research gaps. The
Section 3 describes the datasets, feature engineering and preprocessing procedures, the proposed MSCA-Net architecture, training configuration, and evaluation metrics. The
Section 4 presents the experimental results, comparative performance analyses, and discussion of the findings. Finally, the
Section 5 concludes the paper by summarizing the main outcomes of the study and outlining directions for future research.
2. Related Work
2.1. Traditional Intrusion Detection Systems
Traditional intrusion detection systems have long served as the primary security mechanism in IoMT environments. These systems are primarily categorized into signature-based and anomaly-based approaches. They are also frequently used in hybrid configurations (see
Table 1).
While traditional signature-based methods are effective against known threats, they are completely inadequate against the heterogeneous nature of IoMT and rapidly evolving zero-day attacks. Anomaly-based methods can theoretically detect new attacks. However, they lose their reliability in practice due to high false positive rates, class imbalance, and data distribution skew. While hybrid approaches attempt to combine these two methods, they fall short in meeting the computational costs and real-time requirements of the IoMT. In summary, classical IDS approaches are inadequate in the face of IoMT’s massive traffic patterns, protocol heterogeneity, and complex attack types.
2.2. Deep Learning-Based Intrusion Detection
Deep learning-based intrusion detection systems automatically extract meaningful features from high-volume, complex network traffic in IoMT environments. This offers significant advantages. These approaches primarily rely on CNN, LSTM, and Gated Recurrent Unit (GRU) models. CNNs effectively capture local traffic patterns, such as packet and flow characteristics. LSTM and GRU models, on the other hand, can model sequential attack behaviors by learning temporal dependencies. Hybrid architectures such as CNN+LSTM and BiLSTM/BiGRU process both spatial and temporal information together to deliver higher performance.
These methods offer the advantages of high accuracy, automatic feature extraction, and the ability to detect complex attack types. However, class imbalance in IoMT datasets makes it difficult to learn rare attacks and can lead to overfitting. Additionally, due to data distribution shifts, models’ generalization performance may decline in different environments. The high computational cost, meanwhile, limits real-time usage on resource-constrained IoMT devices. These issues become particularly pronounced in scenarios involving large-scale traffic structures and protocol diversity.
Table 2 below summarizes deep learning-based IDS approaches in the literature, along with their advantages and limitations specific to the IoMT.
Compared to traditional machine learning methods, deep learning-based approaches offer higher accuracy. However, they cannot fully adapt to the heterogeneous, real-time, and resource-constrained nature of the IoMT. In particular, multi-scale feature extraction, attention mechanisms resilient to class imbalance, and low computational cost requirements are of critical importance. If these requirements are not met, generalization performance remains limited.
2.3. Multi-Scale, Hybrid, and Attention/Transformer-Based Architectural Approaches
Multi-scale and hybrid architectures achieve significant success by jointly modeling both the local and global characteristics of IoMT traffic. Multi-scale CNN architectures capture short-term local patterns using different kernel sizes, while models such as LSTM/GRU learn long-term temporal dependencies. Hybrid architectures (CNN + RNN/Transformer), on the other hand, represent both spatial and sequential features together.
Attention mechanisms and Transformer-based approaches enable the model to focus on distinctive features. They also allow the model to learn long-range dependencies. As a result, CNN–Transformer and attention-based hybrid models achieve high accuracy by combining multi-scale feature extraction with temporal modeling. However, these approaches are limited in resource-constrained IoMT environments due to high computational costs, a large number of parameters, and data requirements. This poses a constraint for real-time applications. Furthermore, a static architecture, class imbalance, and data distribution shifts can reduce generalization performance.
Table 3 summarizes the relevant approaches in the literature.
These approaches successfully address the complex nature of IoMT traffic, which is large-scale, heterogeneous, and time-varying. However, they cannot achieve full compliance due to high computational costs and non-adaptive scaling. Furthermore, resource constraints in real-time IoMT environments also limit compliance.
2.4. Challenges in IoMT Intrusion Detection
IoMT intrusion detection systems differ significantly from standard IoT networks. These systems face critical challenges specific to the healthcare environment. Because patient safety is directly affected, these challenges are not only technical but also ethical and of vital importance. The challenges faced by IoMT intrusion detection systems are outlined below:
Data heterogeneity: Different medical devices, such as sensors, implants, and infusion pumps, along with communication protocols, like MQTT, CoAP, BLE, and Wi-Fi, and feature types, such as numerical, categorical, and temporal, give rise to a complex structure. This situation significantly complicates the integrated feature extraction and modeling process.
Low sample count and class imbalance: The extreme scarcity of attack samples and the insufficient representation of rare attack types, such as zero-day and multi-stage attacks, limit the model’s generalization ability. This leads to high false negative rates in classical ML/DL approaches.
Real-time performance: Limited computational power, memory, and energy resources, which conflict with the requirement for millisecond-level low latency and instant detection, pose a significant constraint. This presents a major obstacle, particularly on edge devices.
Reliability: Since false negatives directly threaten patient lives, there is a requirement for high accuracy and a minimum false negative rate. This also necessitates explainability.
Resource constraints: Low power, memory, and processing capacity prevent deep models from operating in real time.
Table 4 summarizes these key challenges specific to the IoMT, their impacts, and typical solution approaches in the literature.
2.5. Research Gap and Positioning of This Study
The current IoMT attack detection approaches exhibit three fundamental limitations:
Single-scale or static multi-scale models: They cannot simultaneously represent the hierarchical structure of IoMT traffic, which spans from the packet level to the session level. CNN, LSTM/GRU, and hybrid architectures are typically limited by fixed kernel sizes or a single time scale [
40,
43].
Attention and Transformer-based methods: While they demonstrate high performance on large and balanced datasets, they exhibit overfitting on rare attacks due to the class imbalance common in IoMT environments. Additionally, they are not suitable for edge devices due to their high computational cost [
31,
41].
Lightweight models: Although they reduce computational cost, they cannot provide consistent generalization performance under protocol heterogeneity and data distribution shifts. These limitations conflict with the fundamental requirements of IoMT systems.
In particular, significant incompatibilities arise regarding low latency, high sensitivity, and operability in resource-constrained environments.
This experimental study directly addresses these gaps. The proposed MSCA-Net model combines multi-scale depthwise separable convolutions (k ∈ {3, 7, 15}), a squeeze-and-excitation-based channel attention mechanism, and a lightweight unidirectional LSTM component into a single, parameter-efficient architecture.
This integrated architecture offers three key advantages:
Multi-scale convolutions simultaneously capture local and global patterns at different temporal scales.
SE attention provides robustness against class imbalance by dynamically weighting discriminative features.
The lightweight LSTM component eliminates the high computational cost of Transformer-based approaches, enabling real-time processing.
The experimental results demonstrate that MSCA-Net achieves high performance on both large-scale multi-protocol datasets, such as CICIoMT2024, and small, imbalanced datasets, like WUSTL-EHMS-2020. Additionally, the model achieves the best average ranking (2.0).
In conclusion, the proposed model addresses the gap in the literature by achieving a balance between multi-scale, attention mechanisms, and computational efficiency. As a result, it offers a low-latency, generalizable, and practical IDS solution for IoMT environments.
3. Materials and Methods
3.1. Data
This study evaluates all models on two publicly available IoMT intrusion detection benchmarks. Together, they cover contrasting data regimes: a compact biometric-enriched testbed and a large-scale multi-protocol benchmark.
To ensure the validity of the reported results, the following data leakage prevention measures were applied: (1) StandardScaler was fitted exclusively on the training partition and applied without refitting to the test partition; (2) sliding window labels were assigned from the final time step only; (3) the pre-split provided by the dataset authors was used directly with no re-splitting across the temporal boundary; and (4) identifier columns, including IP and MAC addresses, were explicitly excluded before training. Additionally, a hash-based deduplication step was applied to remove near-duplicate flows from the CICIoMT2024 dataset prior to training.
3.1.1. WUSTL-EHMS-2020
This study uses the WUSTL-EHMS-2020 dataset, a publicly available benchmark for intrusion detection in healthcare Internet-of-Things (IoT) environments released by Washington University in St. Louis [
44]. The dataset was collected from an emulated hospital network incorporating diverse connected medical devices and simulating realistic electronic health-monitoring traffic. The dataset contains
labeled network-flow records with 44 features (35 network flow metrics + 8 biometric features + 1 binary label), making it well-suited for evaluating multi-class anomaly detection models in safety-critical IoT contexts.
Each record is annotated with an attack category label drawn from the set
. Spoofing attacks impersonate legitimate devices on the network, whereas Data Alteration attacks modify the payload content of health-monitoring messages. The dataset exhibits a natural class imbalance, with normal traffic constituting the dominant class. Let
denote the number of samples in class
; the imbalance ratio is defined as:
This imbalance is explicitly addressed during training through class-weighted loss (see
Section 3.6).
3.1.2. CICIoMT2024
The CICIoMT2024 dataset was produced by the Canadian Institute for Cybersecurity at the University of New Brunswick, Fredericton, New Brunswick, Canada, and is the most comprehensive public IoMT security benchmark to date [
45]. The dataset was collected from a testbed of 40 IoMT devices (25 real devices and 15 simulated devices) using three healthcare protocols: Wi-Fi, MQTT, and Bluetooth. A dedicated network tap provided real-time packet duplication between the switch and Wi-Fi/MQTT devices, while a smartphone and an Ubertooth One sniffer captured BLE traffic; a Faraday Cage was used during BLE experiments to ensure data fidelity.
Classification levels: The dataset supports three hierarchical classification tasks (binary, 6-class, and 19-class). This study uses the 6-class and 19-class classification settings, which are defined as follows:
The full 19-class label set consists of: benign; ARP Spoofing; Recon Ping Sweep, Recon VulScan, Recon OS Scan, Recon Port Scan; MQTT Malformed Data, MQTT DoS Connect Flood, MQTT DDoS Connect Flood, MQTT DoS Publish Flood, MQTT DDoS Publish Flood; DoS TCP, DoS ICMP, DoS SYN, DoS UDP; DDoS TCP, DDoS ICMP, DDoS SYN, and DDoS UDP. These 18 attack subtypes, grouped by category, are summarized in
Table 5 below.
Features and size: The dataset is provided in tabular CSV format with 45 features per record, pre-split into training and test partitions. The Wi-Fi/MQTT partition contains approximately 4.89 million flow records (~377k training, ~98k test), which this study uses directly without re-splitting. MQTT and benign traffic make up the bulk of the data; within DDoS traffic, UDP and ICMP floods constitute the majority of instances. The macro-imbalance ratio across all 19 classes is:
which is substantially larger than in WUSTL-EHMS-2020, making CICIoMT2024 a more demanding benchmark from both a class imbalance and classification granularity perspective. Distinguishing Recon and Spoofing attacks from benign traffic and separating MQTT DoS from DDoS subtypes are particularly challenging tasks due to the similarity of their underlying traffic patterns [
46].
For completeness,
Table 6 below contrasts the label structure of both datasets used in this study side by side.
3.2. Feature Engineering and Preprocessing
A unified preprocessing pipeline was applied to both datasets prior to model training. While the pipeline steps are identical, the specific columns excluded differ between the two datasets owing to their different collection methodologies and feature compositions.
3.2.1. Column Exclusion and Feature Selection
WUSTL-EHMS-2020: Several columns were excluded to eliminate ground-truth leakage and non-informative identifiers. The binary Label column was removed first, as it directly encodes the target class and would constitute data leakage if retained as a feature. The following additional columns were also dropped: directional and flag fields (Dir, Flgs), source and destination IP addresses (SrcAddr, DstAddr), MAC addresses (SrcMac, DstMac), and the mixed-type port field Sport. After exclusion, all remaining numeric columns were retained, yielding a feature set with dimensions per record.
CICIoMT2024: The dataset is provided as pre-extracted numeric flow features with no IP address, MAC address, or identifier columns. The target label column (Label for 6-class, or type for 19-class) was separated from the feature matrix prior to training. All remaining 45 numeric flow features were retained without further exclusion, giving .
3.2.2. Missing Value and Infinity Imputation
Both datasets were inspected for missing and non-finite values. Missing entries (NaN) and infinite values (
), which can arise from division by zero in flow rate calculations [
47], were replaced with zero prior to any scaling:
3.2.3. Standardization
Features were standardized to zero mean and unit variance using
StandardScaler [
48]. For feature
and sample
, the transformation is:
where
and
are the empirical mean and standard deviation of feature
computed exclusively on the training partition. The fitted scaler was then applied without refitting to the test partition, strictly preventing any form of data leakage from the test set into the normalization statistics.
3.2.4. Outlier Clipping
Network flow features in both datasets frequently exhibit heavy-tailed distributions, where legitimate extreme values and recording artefacts co-exist. To suppress the undue influence of such outliers on gradient-based optimization, all standardized feature values were clipped to the interval
:
3.2.5. Label Encoding
For both datasets, string class labels were mapped to contiguous integer indices using LabelEncoder. For WUSTL-EHMS-2020, the encoder was fitted to the fixed vocabulary . For CICIoMT2024, separate encoders were fitted for the 6-class setting () and the 19-class setting (), respectively. In all cases, the encoder mapping was derived from the training partition only and applied consistently to the test partition.
3.3. Data Splitting and Sequence Construction
The preprocessed dataset was partitioned into a training set (80%) and a test set (20%) using stratified random sampling with a fixed random seed of 42, ensuring that the proportion of each class was preserved across both partitions [
49].
To enable temporal modeling, flat record arrays were converted into overlapping fixed-length sequences using a sliding window. Given the full feature matrix
and label vector
, the
-th sequence and its label are defined as:
where
is the window length and
is the stride. The label of each sequence is assigned from its final time step, following the standard convention for sequence-based intrusion detection. This procedure yielded the training sequences
and test sequences
, both of shape
.
3.4. Proposed Model: Multi-Scale Depthwise Attention Network (MSCA-Net)
We propose the Multi-Scale Depthwise Attention Network (MSCA-Net), a novel architecture (see
Figure 2) designed for multi-class intrusion detection in healthcare IoT networks. The architecture integrates three complementary inductive biases, multi-scale depthwise separable convolution, squeeze-and-excitation (SE) channel attention, and lightweight recurrent sequence modeling, into a single parameter-efficient pipeline (see Algorithm 1). The kernel sizes k ∈ {3, 7, 15} in the three parallel depthwise separable towers are deliberately selected to capture packet-level, burst-level, and session-level temporal patterns in IoMT network flows, respectively. Unlike architectures that apply attention within individual branches, the SE attention block in MSCA-Net is placed after multi-scale concatenation, enabling dynamic cross-scale feature weighting on a per-sample basis. A lightweight unidirectional LSTM is preferred over a Transformer encoder due to the quadratic memory cost of self-attention on the short sequence lengths (T = 15) typical in IoMT flow data. The complete forward pass is:
3.4.1. Depthwise Separable Convolution
Standard convolution applies
filters of size
to an input of shape
incurring a parameter cost of
. Depthwise separable convolution [
50] factorises this into a depthwise step (one filter per input channel) followed by a pointwise
convolution:
The parameter cost reduces to
, yielding a reduction factor of approximately
relative to standard convolution. This is essential for maintaining efficiency on WUSTL-EHMS-2020’s small training set of approximately 2550 sequences.
| Algorithm 1. Forward pass of the Multi-Scale Depthwise Attention Network (MSCA-Net) |
Input: —batch of network-flow sequences, each of length time steps with features Output: —logit matrix for attack categories Parameters: Kernel sizes , channels per tower , SE reduction ratio , LSTM hidden size , dropout rate |
Stage 1—Multi-Scale Depthwise Separable Feature Extraction
for each do in parallel ▹Two residual blocks of (depthwise conv→pointwise conv→BN→ReLU) with 1 × 1 shortcut end for ▹
Stage 2—Squeeze-and-Excitation Channel Attention
▹ (squeeze over time) ▹Excitation: ▹
Stage 3—Temporal Sequence Modelling
▹Unidirectional single-layer LSTM ▹ (final hidden state only)
Stage 4—Classification Head
Return |
3.4.2. Multi-Scale Depthwise Separable Feature Extractor
The first stage,
MSDepSepExtractor, consists of three parallel
DepSepTower modules at kernel sizes
. Each tower applies two sequential depthwise separable residual blocks. For a tower with kernel size
and output channel count
, the residual block output is:
where
is a
shortcut projection applied only when input and output channel dimensions differ. The three parallel outputs
are concatenated along the channel dimension:
The three kernel sizes correspond to the three temporal scales present in WUSTL-EHMS-2020: captures fine-grained single-packet anomalies; models short flow bursts spanning three to four packets; and spans the full sequence window to capture session-level patterns.
3.4.3. Squeeze-and-Excitation Channel Attention
Following feature extraction, an SE block [
51] recalibrates the 192-channel feature map by learning which temporal scale is most discriminative for each input sample. The SE block first applies global average pooling over the time dimension:
This descriptor is then passed through a two-layer bottleneck MLP with a reduction ratio
, producing a channel-wise attention weight vector:
where
, and
is the sigmoid function. The bottleneck dimension is
. The attended feature map is obtained by channel-wise multiplication:
where
denotes broadcasting of
along the time axis.
3.4.4. LSTM Sequence Modeling and Classification Head
The attended feature map
is fed into a single-layer unidirectional LSTM [
52] with 128 hidden units. The LSTM recurrence at time step
is:
where
is the SE-attended feature vector at time step
,
,
, and
are the input, forget, and output gates, respectively,
is the cell candidate,
is the cell state, and
denotes the sigmoid function.
The last hidden state
is passed to the classification head:
Input-hidden weights are initialized with Xavier uniform initialization [
52] and recurrent weights with orthogonal initialization [
53].
A single unidirectional LSTM is preferred over a Transformer encoder [
54] because the self-attention mechanism incurs
memory complexity. On WUSTL-EHMS-2020’s small training set (~2550 sequences), this increases overfitting risk and offers no asymptotic efficiency advantage for
.
3.5. Baseline Models
Nine baseline architectures were implemented to provide a comprehensive comparative evaluation spanning the major paradigms in time series classification and sequential intrusion detection.
ResNet1D [55]: A stack of three one-dimensional residual blocks with
convolutions (channels: 64→128→128). Each block computes:
Global average pooling precedes the linear classifier.
Conv-BiLSTM [56]: A two-layer 1D convolutional front-end (64 channels,
) feeding a bidirectional LSTM (128 hidden units per direction). The final hidden states from both directions are concatenated:
;
.
InceptionTime [57]: Three stacked Inception modules, each combining four parallel branches (kernels 1, 3, 5, and a max-pool branch), concatenated and normalized before the next module. Global average pooling produces the final representation.
Transformer [54]: Sinusoidal positional encoding added to a linear projection of the input, followed by two Transformer encoder layers with 4 attention heads,
, and feed-forward dimension 256. The attention mechanism is:
LSTM-FCN [58]: Two parallel branches, a single-layer LSTM (128 hidden units) and a three-layer FCN (channels: 128→256→128), are concatenated before the final classifier. This design captures both long-range temporal dependencies and local multi-scale features simultaneously.
AttBiGRU [59]: A two-layer bidirectional GRU with additive self-attention pooling over all
hidden states. The context vector is:
DilResNet [60]: Four residual blocks with exponentially increasing dilation rates
. For a dilated convolution with rate
, the receptive field after
layers is
, enabling a theoretical receptive field of up to 30 time steps at the final layer.
ConvTran [61]: A two-layer convolutional stem (
, 128 channels) extracts local patterns, whose output is passed to two Transformer encoder layers for global dependency modeling, combining local inductive bias with long-range attention.
SEResNet [51]: Three residual blocks each augmented with a channel SE gate (reduction ratio 16). The SE gate adaptively recalibrates channel responses at every depth level, making it a single-scale counterpart to MSCA-Net’s multi-scale SE design.
3.6. Training Configuration
All models were trained under an identical configuration to ensure a fair comparison. The optimizer was AdamW with a learning rate
and weight decay
:
A cosine-annealing learning rate schedule was applied after a three-epoch linear warm-up phase. At epoch
$e
$ (post warm-up), the effective learning rate is:
where
is the maximum number of epochs and
. Gradient norms were clipped to 0.5 at every update step [
62]. Training was terminated early when the validation weighted F1 score failed to improve by more than
over 10 consecutive epochs.
To address class imbalance, class-weighted cross-entropy loss was used. Per-class weights were computed via the balanced weighting scheme:
where
is the total number of training sequences,
is the number of classes, and
is the number of training sequences in class
. Weights were clipped to
to prevent excessively large gradients. Label smoothing [
63] with
was applied to the target distribution:
All experiments were executed on a CUDA-enabled GPU using PyTorch 2.1.0 with CUDA 11.8 and automatic mixed-precision (AMP) training. Each model was seeded identically (seed = 42). The best checkpoint, selected by the highest validation-weighted F1 score, was retained for all final evaluations.
3.7. Evaluation Metrics
Model performance was assessed on the held-out test set using the following metrics.
Weighted F1 Score: a support-weighted average of per-class F1 scores, robust to class imbalance:
AUC-ROC: evaluated in a one-vs-rest scheme for each class, plus micro-averaged and macro-averaged variants. The micro-averaged AUC pools all class predictions:
Confusion Matrix: both raw count and row-normalized form were computed to reveal per-class error patterns.
4. Results and Discussion
4.1. Performance on CICIoMT2024 (Large-Scale Multi-Protocol Benchmark)
The CICIoMT2024 dataset presents the most challenging evaluation scenario, with 4.89 million flow records spanning six attack categories (coarse-grained) and 19 fine-grained attack subtypes. This dataset tests scalability under severe class imbalance () and protocol diversity (Wi-Fi, MQTT, Bluetooth).
4.1.1. Coarse-Grained Classification (Six-Class)
Table 7 presents MSCA-Net’s granular per-class metrics, revealing exceptional performance across dominant categories balanced against minority-class challenges. The model achieves near-perfect F1 scores for high-volume attacks, DDoS (0.9997), DoS (0.9982), and MQTT (0.9918), with precision–recall balance exceeding 0.98 for all. These results demonstrate robust detection of volumetric and protocol-specific attacks that constitute 89% of the test data. The benign class achieves F1 = 0.9708 with high precision (0.9821), indicating conservative classification that minimizes false alarms on legitimate medical device traffic. Reconnaissance detection remains strong (F1 = 0.9870) despite moderate sample scarcity (969 instances), validating the multi-scale architecture’s capacity to capture session-level scanning patterns through the
k = 15 tower. The critical limitation emerges in Spoofing detection (F1 = 0.5818), where extreme class imbalance (61 samples, 0.11% of data) drives asymmetric performance: 78.69% recall versus 46.15% precision. This pattern, high sensitivity with moderate specificity, reflects both the architectural challenge of ARP Spoofing detection and the strategic prioritization of false negatives over false positives in safety-critical environments. The 0.11% prevalence means each percentage point of precision improvement requires eliminating approximately 12 false positives, suggesting practical deployment should pair MSCA-Net’s high recall with secondary verification mechanisms for Spoofing alerts.
Table 8 contextualizes these results through comparative benchmarking against nine baseline architectures. MSCA-Net achieves the highest rankings among the ten evaluated architectures across all metrics: accuracy (99.75%), weighted F1 (99.77%), and, critically, macro-F1 (92.16%). The macro-F1 advantage of 2.84 points over second-ranked ATT-BiGRU (89.32%) demonstrates superior minority-class robustness, while the 0.18-point accuracy margin translates to 101 fewer misclassifications across 56,495 test samples. Efficiency metrics reveal 2.1× faster inference than ATT-BiGRU (4.851 s versus 10.245 s), attributable to depthwise separable convolutions reducing parameters by factor
k relative to standard convolutions. The performance hierarchy exposes architectural principles: multi-scale approaches (MSCA-Net, InceptionTime) dominate single-scale methods (ResNet1D, SE-ResNet), while lightweight recurrence outperforms Transformers on short sequences. Notably, pure attention mechanisms (ATT-BiGRU) achieve competitive accuracy but fail to match MSCA-Net’s efficiency–recall trade-off. The 10.7-point macro-F1 gap between MSCA-Net and Dil-ResNet (76.61%) quantifies the criticality of both multi-scale design and adaptive channel weighting for imbalanced intrusion detection. These results position MSCA-Net as the optimal architecture for operational IoMT deployment, balancing detection efficacy with computational constraints of edge medical devices. To confirm that these results are not inflated by near-duplicate flows, hash-based deduplication was applied to the CICIoMT2024 dataset prior to training, with performance remaining at 99.71% accuracy after deduplication.
Figure 3 presents raw count confusion matrices for MSCA-Net, InceptionTime, ResNet1D, and Transformer on the CICIoMT2024 six-class task, revealing absolute error distributions across attack categories. MSCA-Net exhibits the tightest diagonal concentration with minimal off-diagonal leakage: DDoS (37,326 correct), DoS (14,542), MQTT (2227), and Recon (950) show strong diagonal dominance. Notable errors include 50 benign samples misclassified as Spoofing and five Spoofing samples distributed across benign (8), DDoS (0), and Recon (0), reflecting the extreme minority challenge (61 total samples). The 11 DDoS→DoS and 33 DoS→DDoS confusions indicate inherent signature similarity between volumetric attack variants.
InceptionTime displays increased dispersion: 125 benign→Spoofing errors (2.5× MSCA-Net’s 50), 2209 DDoS→DoS misclassifications, and five Spoofing→benign errors. ResNet1D exhibits severe majority-class bias: 3249 DDoS→DoS and 849 DoS→DDoS errors demonstrate collapsed discrimination between volumetric attacks, while 151 benign→Spoofing and seven Spoofing→Recon errors indicate systematic minority-class confusion. Transformer shows intermediate patterns: 2365 DDoS→DoS and 4708 DoS→DDoS errors exceed MSCA-Net substantially, with 110 benign→Spoofing and 10 Recon→Spoofing misclassifications. Raw counts confirm MSCA-Net’s superior absolute accuracy across all categories, particularly for critical minority-class detection where competitors exhibit several more errors.
Figure 4 presents ROC curves comparing MSCA-Net (left) and InceptionTime (right) on the CICIoMT2024 six-class task, revealing discrimination performance across operating thresholds. MSCA-Net achieves superior micro-AUC (0.998 versus 0.990) with tighter curve clustering near the top-left corner, indicating reliable classification across all categories. Both models exhibit near-perfect AUC for DDoS (1.000), DoS (0.998 vs. 0.946), MQTT (0.999), and Recon (0.981 vs. 0.990), as these high-volume attacks possess distinctive signatures. Critical divergence emerges for Spoofing: MSCA-Net achieves AUC = 0.804 versus InceptionTime’s 0.991, an unexpected inversion suggesting InceptionTime’s fixed multi-scale aggregation better captures this specific minority class’s limited patterns. However, MSCA-Net’s Spoofing curve shows a steeper initial rise (TPR = 0.78 at FPR = 0.2), preferable for security deployments prioritizing detection over false alarm tolerance. The micro-AUC advantage confirms MSCA-Net’s superior overall calibration, while the Spoofing anomaly motivates future architectural refinement for extreme minority classes.
Despite the overall strong performance, several minority attack classes exhibit substantially lower scores due to extreme class imbalance. The Spoofing class achieves a precision of 0.4615, attributable to only 61 test samples out of 56,495 (0.11% of the test set). Recon-Ping_Sweep yields an F1 score of 0.00 with only seven test samples, and MQTT-DDoS-Publish_Flood achieves an F1 of 0.1333. In safety-critical IoMT environments, the model is intentionally tuned to prioritize recall over precision for these minority classes to minimize the risk of missed attack detections.
4.1.2. Fine-Grained Classification (19-Class)
The 19-class taxonomy decomposes five attack categories into 18 individual subtypes, providing a substantially more challenging discrimination task that tests MSCA-Net’s capacity for fine-grained attack attribution essential for targeted response strategies.
Table 9 exposes specific discrimination challenges within attack families. Network-layer volumetric attacks achieve exceptional performance: TCP/IP-DDoS variants (ICMP, SYN, TCP, UDP) all exceed F1 = 0.995, with TCP_IP-DoS-TCP reaching 0.9997. These results leverage distinctive flow rate signatures easily captured across all architectural scales. Reconnaissance subtypes show graded difficulty. Port Scan (F1 = 0.9612, 792 samples) benefits from clear session structure patterns; OS Scan (F1 = 0.8763, 134 samples) degrades moderately; VulScan (F1 = 0.5970, 36 samples) approaches the usability boundary; and Ping Sweep (seven samples) achieves F1 = 0.000, representing an extreme minority class where no model learns meaningful decision boundaries. MQTT attack differentiation presents the most significant architectural challenge. MQTT-DoS-Publish_Flood achieves F1 = 0.6851 with severe precision–recall asymmetry (0.5220 precision, 0.9966 recall), indicating systematic confusion with similar flooding patterns. Conversely, MQTT-DDoS-Publish_Flood collapses to F1 = 0.1333 (1.0000 precision, 0.0714 recall), suggesting the model learns to suppress this rare subtype entirely. MQTT-DoS-Connect_Flood shows an inverse pattern (0.6587 precision, 1.0000 recall, F1 = 0.7942). These complementary failures, one subtype flooding predictions, the other starved, indicate that multi-scale convolutions cannot resolve fine-grained timing distinctions between connect-oriented and publish-oriented flooding without additional temporal features. MQTT-Malformed_Data maintains strong performance (F1 = 0.9492), as malformed packets produce distinctive feature anomalies.
ARP Spoofing achieves F1 = 0.7667, substantially improved over the six-class aggregation (0.5818), suggesting that subtype-specific training enables finer decision boundary learning despite identical sample count. This improvement validates the hierarchical classification strategy: coarse-grained detection identifies attack presence, fine-grained attribution enables targeted remediation.
Table 10 presents rankings and metrics for ten architectures on the fine-grained CICIoMT2024 19-class task, exposing dramatic performance stratification as classification granularity increases. ATT-BiGRU achieves rank one (99.03% accuracy, 98.89% weighted F1), with MSCA-Net closely trailing at rank two (98.98%, 98.85%), a narrow 0.05-point accuracy inversion from the six-class task. The 2.1× inference advantage of MSCA-Net (4.954 s versus 10.341 s) sustains practical superiority despite a marginal accuracy deficit.
Mid-tier architectures (InceptionTime, ResNet1D, Transformer) achieve 82–93% accuracy, while lower-tier designs collapse catastrophically; LSTM-FCN (60.43%), Dil-ResNet (45.67%), and SE-ResNet (43.93%) fail to discriminate 18 attack subtypes. This 55-point accuracy spread versus 30 points on the six-class task demonstrates that architectural sophistication becomes critical with increased label cardinality. MSCA-Net’s macro-F1 (0.8266) trails ATT-BiGRU (0.8416) but exceeds InceptionTime (0.8359), confirming robust minority-subtype detection. The fastest training (470.8 s) among the top three performers maintains efficiency advantages essential for operational deployment.
Figure 5 presents per-class ROC curves comparing InceptionTime (left) and MSCA-Net (right) on the 19-class task, exposing critical architectural differences in probability calibration and minority-class detection. Both models achieve near-perfect AUC for network-layer attacks (TCP/IP DDoS/DoS variants: AUC > 0.998), with curves hugging the top-left corner. However, substantial divergence emerges for challenging subtypes. MSCA-Net shows superior calibration for ARP Spoofing (AUC = 0.877 vs. InceptionTime 0.808) and MQTT-Malformed_Data (AUC = 0.944 vs. 0.986, slightly lower), with smoother curve progression indicating reliable confidence scores across thresholds. InceptionTime exhibits characteristic “stair-step” ROC patterns for MQTT flooding attacks, particularly MQTT-DDoS-Publish_Flood (AUC = 0.395) and MQTT-DoS-Publish_Flood (AUC = 0.998), suggesting unstable probability estimates that complicate operational threshold selection.
The most striking difference appears in Recon-Ping_Sweep. Both models achieve near-zero AUC (MSCA-Net 0.066, InceptionTime 0.000), confirming that seven training samples preclude learnable discrimination. MSCA-Net’s micro-AUC (0.993) slightly trails InceptionTime (0.995), but this aggregate metric masks superior minority-class robustness. MSCA-Net maintains higher true positive rates at low false positive rates for Recon-VulScan and MQTT-DoS-Connect_Flood, critical for security deployments prioritizing detection over false alarm tolerance. The ROC ensemble validates that MSCA-Net’s multi-scale depthwise design achieves comparable fine-grained discrimination to InceptionTime’s fixed multi-scale aggregation while providing more calibrated probability estimates essential for threshold-based alert systems.
4.2. Performance on WUSTL-EHMS-2020 (Small-Scale Biometric Dataset)
The WUSTL-EHMS-2020 dataset evaluates model efficacy under data scarcity, with only ~16,318 records (13,054 training sequences after preprocessing). This scenario tests architectural efficiency and resistance to overfitting [
44].
Table 11 reveals a ranking inversion compared to CICIoMT2024. ResNet1D achieves top performance (92.46% accuracy, 90.66% weighted F1), followed by Conv-BiLSTM (91.54%, 90.56%). MSCA-Net ranks third with 90.00% accuracy and 89.67% weighted F1, though it maintains the fastest training time (12.063 s) and lowest total runtime (14.2 s) among top performers. This pattern validates that simpler architectures with stronger inductive biases excel when training data is limited. However, MSCA-Net’s macro-F1 (0.7097) exceeds ResNet1D’s (0.6995), indicating superior minority class detection (Spoofing and Data Alteration) despite marginally lower overall accuracy. Dil-ResNet and LSTM-FCN show competitive accuracy (92.77%, 92.62%) but degraded macro-F1 (0.6367, 0.6364), revealing overfitting to majority classes. ATT-BiGRU collapses to rank 10 (85.23% accuracy), confirming that attention mechanisms require abundant data for calibration.
Figure 6 presents normalized confusion matrices for MSCA-Net, Dil-ResNet, LSTM-FCN, and ResNet1D on the three-class task. MSCA-Net exhibits moderate diagonal concentration with notable Data Alteration dispersion into Spoofing (0.15) and benign (0.12), reflecting the challenge of detecting payload content modifications. ResNet1D shows tighter diagonal peaks but excessive benign↔Spoofing confusion, indicating collapsed discrimination between device impersonation and normal traffic. Dil-ResNet displays characteristic “checkerboard” off-diagonal patterns from dilated convolution artifacts, while LSTM-FCN shows asymmetric confusion favoring benign predictions. All matrices reveal universal Data Alteration under-detection, as biometric-enriched features require precise temporal alignment that limited training data cannot support.
Figure 7 illustrates training and validation loss/accuracy curves across epochs for top-performing architectures. MSCA-Net exhibits rapid convergence within 15 epochs with minimal validation gap, confirming parameter efficiency. ResNet1D shows slower convergence but superior final validation accuracy, benefiting from residual skip connections preventing gradient degradation. Conv-BiLSTM displays oscillating validation loss indicative of bidirectional gradient instability. The figure reveals the necessity of early stopping; all models reach peak validation performance by epoch 25–35 before overfitting onset. MSCA-Net’s shallow loss valley enables aggressive early stopping, contributing to its fast training time despite architectural complexity.
Figure 8 presents ROC curves comparing MSCA-Net (left) and ResNet1D (right) on WUSTL-EHMS-2020’s three-class task, revealing critical performance differences on small-scale biometric data. Both models achieve perfect Data Alteration detection (AUC = 1.000), as payload modifications produce distinctive feature signatures. However, Spoofing discrimination exposes architectural divergence. MSCA-Net achieves AUC = 0.703 versus ResNet1D’s 0.675, confirming superior minority-class detection despite lower overall accuracy. The Spoofing curves exhibit characteristic “stair-step” patterns reflecting limited training samples (approximately 1300 sequences), with MSCA-Net showing a steeper initial rise (TPR = 0.6 at FPR = 0.2), indicating better low false positive detection. Normal class curves differ modestly (AUC = 0.843 vs. 0.812), with ResNet1D’s tighter corner hugging contributing to its accuracy advantage. Micro-AUC values (MSCA-Net 0.961, ResNet1D 0.965) mask these per-class trade-offs, validating macro-F1 as the preferred metric for imbalanced security evaluation. The comparison confirms MSCA-Net’s superior calibration for critical attack detection at acceptable false alarm rates.
4.3. Cross-Dataset Generalization Analysis
Table 12 presents comprehensive performance metrics across all three experimental conditions, revealing critical architectural generalization patterns for IoMT intrusion detection. The table synthesizes accuracy, weighted F1, and per-dataset rankings for ten architectures, enabling diagnostic comparison across large-scale multi-protocol traffic (CICIoMT2024 six-class and 19-class classifications) and small-scale biometric data (WUSTL-EHMS-2020). It should be noted that this analysis evaluates cross-regime consistency across different dataset conditions rather than cross-dataset transfer, where the model is trained on one dataset and tested on another; the latter remains a direction for future work.
MSCA-Net achieves the best average rank of 2.0, securing positions 1, 2, and 3 across CIC-6, CIC-19, and WUSTL, respectively. This remarkable consistency demonstrates that multi-scale depthwise separable convolutions combined with adaptive SE attention and lightweight LSTM sequencing provide optimal inductive biases spanning both data regimes. The architecture balances expressiveness for complex multi-protocol traffic with regularization preventing overfitting on limited training examples, a synthesis absent in competing designs.
The contrast with specialized architectures exposes fundamental limitations. ATT-BiGRU (average rank 4.3) dominates large-scale datasets (ranks 1–2, 99.57%/99.03% accuracy) but collapses catastrophically on WUSTL (rank 10, 85.23%), indicating severe over-reliance on abundant training data for attention mechanism calibration. Bidirectional GRU attention degenerates into uniform averaging when training sequences drop below 15,000, destroying discriminative capacity. ResNet1D (average rank 4.3) exhibits inverse specialization, excelling on small data (rank 1, 92.46%) but failing to scale (rank 8 on CIC-6, 76.03%), confirming that single-scale 3 × 1 convolutions lack capacity for hierarchical temporal modeling. These opposite failure modes yield identical average ranks but render both architectures unsuitable for generalizable deployment.
InceptionTime (average rank 4.7) achieves moderate consistency (ranks 3, 3, 8) through fixed multi-scale aggregation but cannot adapt scale importance to input characteristics, limiting performance ceilings and wasting capacity on irrelevant scales when data is scarce. The five-rank gap to MSCA-Net on WUSTL specifically indicates static feature concatenation deficiencies.
Lower-tier architectures demonstrate progressive degradation. Conv-BiLSTM and Transformer (both 5.3) form a middle tier with moderate variance but no peak performance. SE-ResNet (9.3) and Dil-ResNet (7.7) confirm that channel attention or dilation alone cannot substitute for explicit multi-scale design, collapsing to 43–72% accuracy on large-scale tasks.
Metric sensitivity analysis reveals evaluation nuances. On CIC datasets, accuracy and weighted F1 correlate strongly (ρ > 0.99). However, on WUSTL, macro-F1 inversions occur; MSCA-Net trails ResNet1D in accuracy (−2.5 points) but leads in macro-F1 (+1.0 point), demonstrating that security-critical evaluation must prioritize per-class balance over aggregate accuracy. MSCA-Net’s superior minority-class detection (Spoofing, Data Alteration), despite an overall accuracy deficit, validates architectural suitability for safety-critical environments where missed attacks cost lives.
Operational implications are profound. Hospital networks exhibit heterogeneous data regimes: large centers generate CIC-scale traffic, and rural clinics operate at WUSTL-scale. Deploying ATT-BiGRU risks catastrophic failure at smaller sites; ResNet1D sacrifices capability at major centers. Only MSCA-Net maintains reliable performance across this spectrum, with efficiency advantages enabling rapid adaptation. The 2.0 average rank thus translates to reduced deployment risk and unified architectural standards across healthcare infrastructure scales.
From a Pareto-efficiency perspective, MSCA-Net consistently occupies the optimal region of the accuracy–runtime space across all three experimental conditions. On the six-class task, it achieves the highest accuracy (0.9975) with an inference time of 4.851, outperforming ATT-BiGRU (10.245 s) and Conv-BiLSTM (15.640 s), which rank second and seventh, respectively. On the 19-class task, MSCA-Net ranks second in accuracy while requiring only 4.954 s inference time compared to the first-ranked ATT-BiGRU (10.341 s). On WUSTL-EHMS-2020, it achieves rank three with the second lowest total runtime (14.2 s). These results demonstrate that the additional architectural components do not introduce disproportionate computational overhead relative to the performance gains achieved.
4.4. Discussion
The empirical results confirm that MSCA-Net strikes an optimal balance between expressiveness and efficiency for IoMT intrusion detection. On large-scale, multi-protocol data, the architecture outperforms or matches far more parameter-heavy alternatives while running 2× faster at inference, a critical advantage for real-time edge deployment on battery-powered medical devices. On small biometric-enriched datasets, MSCA-Net remains competitive and delivers the fastest training, enabling rapid hospital-specific fine-tuning without architectural modification.
The multi-scale depthwise design proves particularly effective at capturing the hierarchical nature of IoMT traffic. Parallel towers at k ∈ {3, 7, 15} model packet-level anomalies, burst patterns, and session-level progressions, respectively, and the temporal granularities inherent to healthcare network behavior. SE attention provides adaptive scale weighting that single-scale baselines lack, dynamically emphasizing relevant features per input sample. The lightweight unidirectional LSTM mitigates the overfitting risk observed in Transformer-based models on short sequences (T = 15), where self-attention’s quadratic complexity offers no asymptotic advantage.
Cross-dataset analysis reveals architectural specialization patterns with profound operational implications. Attention-based approaches (ATT-BiGRU) dominate abundant data but collapse when training examples fall below 15,000 sequences. Pure convolutional designs (ResNet1D) excel with limited data yet fail to scale for complex multi-protocol traffic. Only MSCA-Net’s unified design, multi-scale depthwise convolutions, adaptive channel attention, and constrained recurrence maintain consistent top-tier performance across regimes. This generalization capability reduces deployment risk in heterogeneous healthcare environments where data availability varies dramatically between tertiary centers and rural clinics.
Class imbalance handling via weighted loss and label smoothing proves effective overall, yet extreme minority classes remain challenging. Recon-Ping_Sweep (seven samples) achieves zero F1 across all models; ARP Spoofing (61 samples) reaches only 58–77% F1 despite architectural sophistication. These edge cases represent fundamental limitations of gradient-based learning from scarce examples. Future work could incorporate synthetic minority oversampling techniques, such as SMOTE or generative augmentation, or explore meta-learning approaches that transfer knowledge from abundant attack categories to rare subtypes.
Computational efficiency metrics position MSCA-Net for practical deployment. Training in 470–752 s across datasets enables rapid iteration during hospital onboarding. Inference at 4.85–4.95 s per 56,000 sequences supports sub-millisecond per-packet processing budgets essential for real-time gateway deployment. The 2.1× speed advantage over ATT-BiGRU stems directly from depthwise separable convolutions, which reduce parameters by factor k relative to standard convolutions without sacrificing representational capacity.
Although dedicated ablation experiments were not conducted, the comparative baseline evaluation provides surrogate evidence for each component’s contribution. SE-ResNet, which uses single-scale convolution with channel attention but no recurrence, achieves a macro-F1 of 0.7789 on the six-class task, a gap of 0.0427 below MSCA-Net (0.9216), indicating the value of multi-scale depthwise convolutions. InceptionTime, which provides multi-scale feature extraction without SE attention or LSTM recurrence, achieves a macro-F1 of 0.8485, suggesting the combined contribution of the SE attention and LSTM components. ATT-BiGRU, which offers attention-based recurrence without multi-scale convolution, drops to rank 10 on WUSTL-EHMS-2020 (
Table 11), contrasting with MSCA-Net’s rank three, which indicates that the multi-scale convolutional front-end is essential for robustness under data scarcity. Formal ablation experiments isolating each component represent an important direction for future work.
Limitations motivate continued refinement. Current concatenation-based fusion treats all scales equally before SE weighting; learned fusion strategies could improve scale selection. Fixed window length (T = 15) prevents cross-window pattern detection; hierarchical attention over variable-length segments could capture extended attack progressions. Federated extensions would enable privacy-preserving adaptation across hospitals without centralizing sensitive health data. Hardware optimization through TensorRT quantization could further reduce edge inference latency for ultra-low-power medical devices.
In summary, MSCA-Net’s consistent top-tier ranking, combined with parameter and inference efficiency, positions it as a practical, deployable solution for multi-class intrusion detection in resource-constrained, safety-critical IoMT environments.
4.5. Limitations and Real-World Applicability
MSCA-Net demonstrates high performance on the CICIoMT2024 dataset, achieving 99.75% accuracy and a weighted F1 score of 99.77. However, it is clear that data emulated in a laboratory setting cannot fully reflect data distribution shifts, device heterogeneity, and multi-protocol dynamics found in real clinical environments. Many systematic reviews in the literature emphasize that the results of most IoMT IDS solutions obtained in laboratory settings show significant performance drops in field applications [
4,
14].
In this context, the CICIoMT2024 dataset offers a more realistic traffic structure, created using over 40 real medical devices and 18 different attack vectors. This situation provides a more reliable foundation for evaluating MSCA-Net compared to similar studies in the literature. Indeed, a more comprehensive experimental environment has been established compared to studies accepted in the literature and those conducted in recent years [
9,
26,
28,
33,
35]. However, additional studies are needed to fully validate the model in real clinical environments. In this context, real-time tests on edge/fog gateways are planned. Additionally, the goal is to adapt the model to multi-hospital environments using a federated learning approach and to conduct field validation with live patient data. These steps will contribute to demonstrating that MSCA-Net can maintain its high performance under real-world conditions, which are critical for patient safety.
5. Conclusions
This study presented MSCA-Net, a Multi-Scale Depthwise Attention Network specifically engineered for IoMT intrusion detection. The architecture unifies three complementary advances: (1) parallel depthwise separable convolution towers at scales k ∈ {3, 7, 15} for multi-granular traffic analysis; (2) squeeze-and-excitation channel attention applied to concatenated multi-scale features for adaptive scale selection; and (3) lightweight unidirectional LSTM sequencing to avoid Transformer’s quadratic complexity on short sequences.
Comprehensive evaluation across two public benchmarks demonstrates MSCA-Net’s superior generalization capability. On CICIoMT2024, 4.89 million flows spanning six coarse-grained and 19 fine-grained attack categories, the model achieves 99.75% accuracy and 99.77% weighted F1 on six-class detection (state of the art), with 98.98% accuracy and 98.85% weighted F1 on 19-class attribution. On WUSTL-EHMS-2020, 16,000 records with biometric enrichment, the model maintains 90.00% accuracy with the fastest training time (12.1 s) among competitive architectures. The best average rank of 2.0 across all experimental conditions outperforms nine strong baselines, including Transformer, ResNet, and attention-based recurrent networks.
Cross-dataset analysis reveals critical architectural insights. Pure attention mechanisms dominate large-scale data but collapse under scarcity; pure convolutional designs excel with limited data yet fail to scale. Only MSCA-Net’s unified approach, combining factorized multi-scale convolutions, adaptive channel weighting, and constrained recurrence, maintains consistent performance across heterogeneous data regimes. This stability reduces deployment risk in operational healthcare environments where network scale and data availability vary dramatically.
Efficiency metrics confirm practical deployability. Depthwise separable convolutions yield a 2.1× inference speed advantage over attention-based alternatives, enabling real-time processing on resource-constrained medical gateways. Sub-500 s training times facilitate rapid hospital-specific adaptation without architectural re-engineering.
The identified limitations, extreme minority class detection, fixed temporal windows, and centralized training requirements motivate future research directions. Synthetic oversampling, hierarchical temporal modeling, and federated learning extensions promise enhanced capability while preserving the core efficiency advantages established in this work.
In conclusion, MSCA-Net delivers the accuracy, efficiency, and generalization balance necessary for trustworthy intrusion detection in safety-critical Internet of Medical Things environments, advancing the state of the art in healthcare cybersecurity.
Limitations and Future Works
Several limitations motivate future research:
Adaptive Multi-Scale Fusion: Current concatenation-based fusion treats all scales equally before SE weighting. Learned fusion strategies (e.g., neural architecture search for kernel combinations) could improve scale selection.
Temporal Hierarchy: The fixed window length (T = 15) and stride (s = 15) prevent cross-window pattern detection. Hierarchical attention over variable-length segments could capture session-level attack progressions.
Federated Adaptation: IoMT networks exhibit device heterogeneity. Federated learning extensions of MSCA-Net would enable privacy-preserving model adaptation across hospitals without centralizing sensitive health data.
Explainability: While SE blocks provide channel importance scores, attention visualization for security operators remains underdeveloped. Class activation mapping (CAM) extensions to 1D temporal convolutions would enhance deployability.
Hardware Optimization: The current implementation uses PyTorch 2.1.0 AMP. TensorRT quantization and kernel fusion would further reduce edge inference latency for battery-powered medical devices.
Cross-dataset transfer evaluation, where the model is trained on one IoMT dataset and tested on a fully independent one, represents an important direction for assessing robustness under real-world distribution shifts.
Furthermore, all inference timing experiments were conducted on GPU hardware; deployment studies on representative edge platforms, such as Raspberry Pi or NVIDIA Jetson, were not performed, and hardware validation remains a key part of future work.
On-device deployment and latency benchmarking on resource-constrained IoMT edge hardware represent a key direction for future validation.
A key limitation of the current study is the unresolved class imbalance challenge for extreme minority attack classes such as Spoofing, Recon-Ping_Sweep, and MQTT-DDoS-Publish_Flood. Future work should explore oversampling techniques such as SMOTE, generative data augmentation, and cost-sensitive focal loss variants to improve minority class detection performance.