1. Introduction
The rapid development of computers and the Internet has ushered in an era of unprecedented convenience and connectivity for humankind, enabling digitalization and automation in both personal and business domains. However, this progress has also brought forth a surge in complex cybersecurity issues, encompassing large-scale data breaches [
1,
2], network attacks [
3,
4,
5,
6], malicious software [
7,
8], and vulnerabilities in Internet of Things (IoT) devices [
9,
10,
11,
12]. These issues pose significant threats to personal privacy, corporate secrets, and the security of critical infrastructure.
In recent years, the development of artificial intelligence has provided more advanced technological support for cyber attacks, posing greater challenges to cyber security. Consequently, it necessitates continual efforts to enhance cybersecurity capabilities, ensuring the secure and beneficial utilization of computers and the IoT in this digital age.
As an active defense technology, intrusion detection systems (IDSs) that can identify abnormal requests in the communication network, detect potential network threats, and generate alarms have gradually become a key technology to ensure network security. An IDS detects intrusion behavior by analyzing activities such as system logs, CPU and memory usage, system calls, file modifications, and real-time network traffic characteristics, which we will call views in this paper [
13,
14]. Multi-view analysis and learning can efficiently improve the accuracy and robustness of detection systems. By integrating the multiple views or perspectives of data mentioned above, multi-view learning enables a comprehensive analysis of the underlying patterns and characteristics of network intrusions. This approach leverages the diversity of information sources, allowing for a more holistic understanding of attacks and facilitating the identification of sophisticated and stealthy intrusion attempts that may be missed by single-view methods. Additionally, multi-view learning can effectively mitigate the impact of noisy or incomplete data, as information from different views can complement and cross-validate each other, leading to improved detection performance and reduced false positive rates [
15,
16,
17]. Furthermore, the incorporation of multiple views in the learning process enhances the resilience of IDSs against evasion techniques employed by attackers, as any single-view modification or manipulation is less likely to go unnoticed. Thus, this paper utilizes a multi-view learning framework to enhance the accuracy, robustness, and adaptability of IDSs in the face of evolving cyber threats in systems. While we focus on the IT domain in this work, in the context of business-related zoning, multi-view techniques are also very applicable to combined IT/OT scenarios.
Based on the fused features, a classifier that determines if anomalies occur is another important step in intrusion detection. The classifier uses the extracted features as input and applies a predefined algorithm or model to make predictions about the nature of the traffic. There are two kinds of classifiers called rule-based classifiers and machine learning classifiers. Rule-based classifiers utilize predefined rules or signatures to match against the extracted features and determine if an intrusion is present. It is effective in detecting known attacks but may struggle with detecting new or unknown attack patterns. Machine learning algorithms, such as decision trees [
18,
19], SVMs [
20,
21,
22], random forests [
23,
24,
25], or neural networks [
26,
27], can be trained using labeled datasets to learn patterns and make predictions. These classifiers have the ability to detect both known and unknown attack patterns by learning from historical data.
Although neural networks dominate modern classification tasks, SVMs remain relevant in specific scenarios. Firstly, SVMs exhibit superior performance on small-scale datasets as they minimize structural risk rather than empirical risk, enabling robust generalization with limited labeled samples. Secondly, SVMs provide interpretable decision boundaries through kernel-induced feature mapping, which is critical for security-critical applications requiring transparent reasoning. Thirdly, SVMs offer computational efficiency in high-dimensional spaces, avoiding the heavy parameter tuning and computational costs associated with deep neural networks, making them suitable for resource-constrained edge devices in distributed networks.
However, the traditional IDS follows a structured workflow, where feature learning extracts relevant features from the data, and classifiers, such as SVMs, use these features to make decisions regarding the presence of an attack, which is called a pipeline system [
24,
28,
29]. The goal of feature learning is to transform the raw data into a more informative and condensed representation that enhances the detection capabilities of the subsequent classifier. However, there is a feature–classifier mismatch problem that means the extracted features may not be optimally aligned with the requirements of the classifier. The features learned during the feature extraction step may not fully capture the discriminatory aspects relevant to the classifier, leading to sub-optimal detection accuracy. Traditional neural network-based approaches can combine the feature learning and classifier into a whole network by adding a softmax layer following feature learning layers. But they require a large amount of data for parameter learning while the traditional machine learning approaches, such as SVMs and decision trees, perform well on limited datasets. Thus, in this paper, we proposed a neural network-based SVM (NSVM) approach for intrusion detection, where an SVM layer is connected to feature learning layers. Specifically, we use an auto-encoder to fuse multiple views and extract discriminative features from the hidden layers, and an SVM layer is connected to the hidden layers for classification. In this way, the feature fusion and classification are combined into a whole neural network and optimized jointly. We call the proposed approach an auto-encoder neural network SVM (AE-NSVM).
Moreover, we propose to train the models using multi-view, fusion-based federated learning. Federated learning is a decentralized machine learning approach that addresses the challenges of distributed learning and privacy protection. Federated learning trains the model on local devices, and it then transmits the encrypted model parameters to a central server. Next, the model on the central server is updated by aggregating received parameters. First, to fuse the multiple views describing the states of various aspects of hosts and networks, we utilize an auto-encoder (AE) to learn representative features; second, the fused features are fed into an SVM to classify the samples into normal or seven other kinds of attacks. Different from the traditional pipeline system that includes two separated sequential steps of feature learning and classification, we propose an AE-NSVM model wherein an SVM layer is connected to the AE hidden layer to simultaneously perform feature learning and classification. In AE-NSVM, feature learning aims to learn representative information from multiple views, as well as improve classification results by combining the two processes together. Finally, the proposed AE-NSVM models are created on multiple clients using a federated learning strategy. We implemented four kinds of AE to compare the feature learning performances.
The contributions of this paper are as follows.
Multi-view fusion: We propose an AE-based multi-view fusion approach for intrusion detection. To leverage heterogeneous intrusion data, we integrate multiple modalities to capture complementary intrusion patterns;
Joint loss optimization: Integrating AE multi-view reconstruction loss and SVM’s hinge loss to align feature learning with classification objectives;
End-to-end feature-classifier integration: Direct connection between the AE hidden layer and SVM layer eliminates information loss in pipeline-based approaches. Experimental results show a 1.4% F1-measure improvement over separate AE+SVM frameworks;
Systematic evaluation of AE variants: Convolutional auto-encoder (CAE) is identified as optimal (F1-measure 0.781) due to its superiority in extracting local spatial correlations from high-dimensional data.
The rest of this paper is organized as follows.
Section 2 introduces the related work on current intrusion detection technologies.
Section 3 proposes an AE-NSVM approach for intrusion detection.
Section 4 presents an approach to analyze the anomaly in different clients using a federated learning strategy.
Section 5 is the analysis and experimental results of the proposed approaches.
2. Related Work
Feature learning, as the first important step of intrusion detection, enables the automatic extraction of relevant and discriminative patterns from network data. Multi-view feature fusion further enhances the effectiveness of intrusion detection by integrating information from multiple perspectives or data sources. This approach harnesses the complementary nature of different views, such as network traffic, system logs, or host states, to create a more comprehensive and accurate understanding of the network environment. He et al. [
30] proposed a multimodal approach as a multi-view technique for intrusion detection. Different level features are extracted from the network connection, rather than the long feature vector used in the traditional approach, which can process feature information separately in a more efficient manner. The authors were able to significantly improve the detection accuracy on a variety of datasets, including outdated and novel ones. However, the authors did not take into account the network traffic behavior changes over time. On the other hand, Li et al. [
31] proposed a multi-view approach for spam detection in resource-constrained environments. The authors assumed a semi-supervised setting, wherein a multi-view setting was used for the label of other events. Although the authors considered a more realistic scenario with model updates, their used dataset does not present a long data period without natural behavior changes. The authors in [
32] proposed a semi-supervised co-training approach using a multi-view nature of attacks. In this approach, the attack behavior will be maintained in multiple views, and attack detection will be performed using the predictions conducted by ML models of multiple views of an attack. They used a centralized approach for implementing their research and utilized an active labeling procedure for labeling unknown attacks by experts. The researchers in [
33] introduced multi-view features of MQTT data and evaluated features using centralized ML algorithms. The authors in these works proposed their methodologies as a centralized approach. Attota et al. [
34] proposed a federated learning-based intrusion detection approach called multi-view federated learning intrusion detection (MV-FLID), which trains on multiple views of IoT network data in a decentralized format to detect, classify, and defend against attacks. The multi-view ensemble learning aspect helps in maximizing the learning efficiency of different classes of attacks.
The distributed nature of multiple clients often necessitates that the data within these clients comply with privacy protection requirements. Traditional centralized intrusion detection models require all data to be aggregated at a central data center for model training. This centralized network model structure conflicts with the inherently distributed nature of federated learning environments and fails to protect private data. Therefore, centralized intrusion detection models are no longer adequate to meet the security requirements of such distributed network structures. Federated learning perfectly matches the distributed nature of trusted regions and protects local data privacy by only passing model parameters. The distributed nature of the approach ensures that training data confidentiality is maintained on the devices, while the shared model gains from the pooled knowledge of all the devices. Nguyen et al. [
35] introduced federated learning into an intrusion detection field for the first time and presented a self-learning distributed system in which the federated learning model performs intrusion detection by detecting device status. Zhao et al. [
36] constructed a federated learning model based on Convolutional Neural Networks (FedACNN), and the experimental results proved that FedACNN can improve the detection accuracy up to 99.7%. Li [
37] first proposed a federated learning method based on a real industrial environment, which achieved experimental results superior to traditional machine learning algorithms. Campos et al. [
38] verified that different data distributions have a significant impact on the effectiveness of federated learning. To balance sample distribution, the authors proposed a sampling method based on Shannon entropy, which achieved good experimental results.
Despite advancements in multi-view intrusion detection, three critical limitations persist in existing approaches.
Objective misalignment in sequential pipelines: Traditional multi-view frameworks (e.g., cascaded auto-encoder classifier designs) decouple feature extraction and classification into independent, sequential stages. This separation forces feature extractors to optimize reconstruction-oriented losses (e.g., MSE for auto-encoders) without direct guidance from downstream classification objectives, resulting in features that are suboptimally discriminative for intrusion detection tasks and prone to information loss during stage-wise transfer;
Insufficient multi-view fusion in federated learning: Federated learning applications in intrusion detection primarily focus on single-view data, neglecting complementary insights from heterogeneous modalities. Additionally, systematic evaluations of feature fusion models (e.g., auto-encoder variants) under distributed settings remain scarce, hindering the balance between detection accuracy and computational efficiency;
Non-end-to-end integration of hybrid models: Hybrid architectures combining neural networks (e.g., auto-encoders) and traditional classifiers (e.g., SVMs) adopt a rigid “feature extraction → classification” pipeline without cross-module joint training. Unlike end-to-end models, this design prevents the backward propagation of classification loss to the feature extractor, leaving neural network parameters optimized solely for reconstruction rather than discriminative feature learning. This disconnection of parameter updates limits the model’s ability to adapt features to intrusion detection-specific decision boundaries.
To address these limitations, this study proposes a multi-view federated learning framework integrating Auto-Encoder Neural SVM (AE-NSVM). This approach directly connects an SVM classification layer to the hidden layer of an auto-encoder (AE), enabling end-to-end joint optimization of multi-view feature fusion and intrusion classification. By unifying the reconstruction loss (for feature learning) and hinge loss (for classification) into a single objective function, the framework aligns feature representation with detection tasks, mitigating the information loss in traditional pipelines. Furthermore, the proposed method systematically evaluates four AE variants to identify optimal multi-view fusion performance, and it extends the model to federated learning, ensuring privacy preservation while enhancing detection accuracy and computational efficiency across distributed clients.
3. Proposed Approach
In this section, we will describe the architectures of the proposed approaches. Mainly, we utilize AEs to fuse five views that provide different perspectives on the host activities to take advantage of the complementary information in each view, as shown in the lower part of the architecture. Multi-view fusion techniques can also help to reduce false positives and increase the overall detection rate [
39,
40,
41]. The fused features are used for intrusion detection using an SVM classifier. We refer to these separate processes of feature fusion and classification as pipeline systems. However, the pipeline system treats feature generation and classification as distinct processes, where specialized algorithms are employed for feature generation, with no explicit optimization for classification. This dichotomy may cause a reduction in the detection performance of the system. Thus, we use a joint training strategy that combines the two processes within a singular neural network architecture, facilitating the acquisition of features that are specifically optimized for classification purposes. This mechanism yields superior feature learning and classification. In this paper, we introduce AE Neural SVM (AE-NSVM), an architecture that uses AEs for feature fusion extraction and an SVM layer for classification to fuse the multiple views of the host. In order to effectively address the issues of data privacy and security in a distributed environment, while also enhancing the effectiveness and efficiency of machine learning, we propose to apply our AE-SVM model to multiple clients, and federated learning is utilized to train the model parameters.
Figure 1 shows the architecture of the pipeline system of AE and SVMs. The training process includes two steps of feature fusion using an AE and SVM classification. Firstly, in the encoding part of the AE, the data from five views are embedded to vectors with fixed dimensions, and then the five vectors are concatenated together and fed into a bottleneck layer. In the decoding part, the five views are reconstructed based on the features from the hidden layer. The loss is computed by the reconstruction errors of each view and summed together. Secondly, the input data of the five views are fed into the trained AE, and the corresponding fused features are obtained from the bottleneck layer, which are then fed into the SVM for classification. The relationship between the network, layers, loss function, and the optimizer is shown in
Figure 1.
To align the objective function for the feature fusion with the final classification objective, we unify feature fusion and SVM classification in a single neural network framework by replacing the SVM classifier with a non-linear support vector output layer. During training, the margin-based loss for this support vector output layer and the reconstruction loss for the auto-encoder are simultaneously computed, and the entire network is optimized using the gradient descent algorithm. The training process of the proposed AE-NSVM is shown in
Figure 2.
The architecture of the proposed AE-NSVM-based federated learning approach involves training a shared global model on three clients while keeping the data on each client local. The federated learning process follows these steps: First, a global model is initialized and distributed to clients. Each client then trains the model locally using its own data. Next, the server aggregates the model updates from all clients. These steps are repeated for multiple rounds until the global model converges. Throughout the process, client data remains decentralized, ensuring privacy and security. By leveraging the collective knowledge learned from distributed datasets, federated learning enables collaborative model training while preserving data privacy.
3.1. Auto-Encoder-Based Multi-View Fusion
Multi-view fusion enables the incorporation of complementary information from multiple views and gains a more comprehensive and accurate understanding of potential threats and anomalies, leading to improved detection performance and a reduction in false positives.
The multi-view fusion process of the proposed auto-encoder (AE) is illustrated in the accompanying diagram, which outlines the workflow from raw view input to fused feature learning and view reconstruction. As shown in
Figure 3, the framework takes five heterogeneous views (denoted as
to
) as input, each representing distinct aspects of the host activities (e.g., memory usage, processor metrics, disk, process behavior, and network traffic). First, each view undergoes independent encoding through view-specific encoder modules. These encoders transform raw features into compact latent representations (denoted as
to
), capturing view-specific patterns. The latent vectors from all views are then concatenated into a unified feature vector
, which is fed into a bottleneck layer to learn cross-view correlations and generate a fused feature representation
H. To ensure the fused features retain discriminative information from all input views, the framework includes a decoder module that reconstructs each original view from the fused feature
H. The reconstruction loss between input views and their decoded counterparts (e.g.,
) drives the encoder to preserve critical information during feature fusion. This dual process of encoding–decoding not only integrates multi-view data, but also ensures the fused features are both representative and reconstructively accurate, laying a foundation for downstream intrusion detection tasks.
Figure 3 shows the feature fusion process. There are five views in the AE, and for each view, denoted as
, they are projected into a dense vector representation, denoted as
, using an encoder function
. The encoding process can be represented as follows:
Then, the dense vectors of the five views,
, are concatenated together into a single vector representation, denoted as
:
The concatenated vector
is then projected into a hidden layer, which is represented by a transformation matrix
and a bias term
, and this is achieved using an activation function ReLU:
In the decoding part, each view is reconstructed by an embedding layer and followed by an output layer:
Similar to the standard auto-encoder, the quality of the auto-encoder output is evaluated by comparing the reconstructed views with their corresponding original views. The reconstruction loss, typically measured using a suitable loss function such as the mean squared error (MSE), quantifies the dissimilarity between the original views
and their reconstructed counterparts
:
3.2. SVM-Based Intrusion Detection
Support Vector Machines (SVMs) are a powerful class of machine learning algorithms used for classification tasks. The key function of SVMs is to find an optimal hyperplane that maximally separates classes or fits the regression line with the largest margin. This margin allows SVMs to be robust to noise and generalize well to unseen data. For multi-class classification, the one-vs-rest strategy is used to train multiple binary SVM classifiers, each one distinguishing between one class and the rest of the classes.
From Equation (
3), we can obtain the fused feature vector
of dimension
D. Additionally, you have corresponding labels
for each sample, where
takes values from 1 to 8, representing the eight classes. For each class
k from 1 to 8, a binary SVM classifier is trained to distinguish between class
k and the rest of the classes. The training set for class
k is denoted as
and the corresponding labels as
.
For the
classifier, the original labels
is transformed into binary labels:
for each class
k. These are then used to solve the SVM optimization problem using a quadratic programming solution:
Here, represents the weight vector, is the bias term, are the slack variables, and C is the regularization parameter that controls the trade-off between maximizing the margin and minimizing misclassifications.
To predict the class for a new input sample
, first calculate the decision function for each class
k from 1 to 8:
The predicted class for
is the one with the highest decision value:
3.3. AE Neural SVM for Intrusion Detection
Section 3.1 and
Section 3.2 describe a typical pipeline system that includes two steps of feature fusion and classification, and each step has its individual optimization function. The inconsistency of the objective function may potentially diminish the performance of classification. Based on this inspiration, we introduce an AE Neural SVM architecture that combines the benefits of both auto-encoder and SVM models for enhanced classification performance. Our approach involves incorporating an SVM layer following the hidden AE layer during joint training, as shown in
Figure 2. By doing so, we can optimize both the reconstruction loss of the auto-encoder and the hinge loss of the SVM simultaneously. This joint training enables the auto-encoder to learn more informative and task-specific representations, potentially leading to enhanced classification performance.
Figure 2 shows the process of the proposed AE Neural SVM. By adding an SVM layer to the hidden layer of AE, the hinge loss is computed as follows:
where
represents the output of the decision function for the given feature vector
from the hidden layer of AE. The loss of AE-NSVM is computed as follows:
where
is a scalar between 0 and 1, the optimal
is determined by tuning on a development set, and
is the reconstruction error for each view.
3.4. Theoretical Complexity Analysis
To further validate the efficiency of the proposed framework, this section theoretically compares the spatial complexity and temporal complexity of three multi-view models: the proposed AE-NSVM, the pipeline-based AE-SVM, and the baseline multi-view-DNN (MV-DNN). The AE-NSVM architecture consists of V view-specific encoders and V symmetric decoders. A hinge loss layer directly connected to the bottleneck layer performs classification, enabling joint optimization of AE reconstruction loss and SVM hinge loss. The AE-SVM contains a multi-view AE that is identical to the AE part of AE-NSVM, as well as a independent SVM classifier. The MV-DNN is a multi-view deep neural network (multiple inputs) containing a multiple inputs layer, a concatenation layer, and a softmax layer.
3.4.1. Spatial Complexity
Spatial complexity is defined as the total number of trainable parameters, and it is derived from model components, such as encoders, decoders, fusion layers, and classifiers. Equation (
13) describes the spatial complexity of AE-NSVM, defined as the total number of trainable parameters:
where
V denotes the number of views,
represents the parameters of a one view-specific encoder,
is the parameters of a one view-specific decoder,
B denotes the bottleneck layer parameters, and
is the parameters of the hinge loss classifier.
Equation (
14) describes the spatial complexity of AE-SVM, including parameters of the multi-view AE and independent SVM:
where
, and
B are defined identically to AE-NSVM, and
denotes the parameters of the independent SVM classifier.
Equation (
15) describes the spatial complexity of MV-DNN, excluding decoder parameters as follows:
where
V is the number of views,
represents the parameters of a one view-specific DNN encoder,
F denotes the concatenation layer parameters, and
is the parameters of the softmax classifier.
From Equations (
13) and (
14), we can see that AE-NSVM exhibits lower spatial complexity than AE-SVM, primarily due to the end-to-end integration of classification into the AE framework. Both models share identical multi-view AE components, but their classification modules differ fundamentally. AE-NSVM replaces AE-SVM’s independent SVM with a lightweight hinge loss layer. The SVM’s large parameter count stems from its support vector coefficients, which scale with dataset size, whereas the hinge loss layer uses fixed-size linear weights achieving parameter efficiency without sacrificing discriminative power. AE-NSVM has slightly higher spatial complexity than MV-DNN, which is attributed to its multi-view decoders. MV-DNN omits decoders entirely, relying solely on encoders and softmax layers for classification.
3.4.2. Temporal Complexity
Temporal complexity is measured by floating-point operations (FLOPs) during training and inference, with
N as the sample size,
T as the training epochs, and
B as the batch size. Equations (
16–
18) describe the training complexity (FLOPs) of the three models:
where
T is the training epochs,
N is the sample size,
B is the batch size,
is the encoder/decoder FLOPs,
denotes the bottleneck FLOPs, and
denotes the hinge loss FLOPs.
where
denotes SVM quadratic programming complexity.
where
is the DNN encoder FLOPs per view,
is the concatenation layer FLOPs, and
denotes the softmax FLOPs.
Equations (
16–
18) indicate that AE-NSVM balances efficiency and modularity, outperforming AE-SVM in large-scale scenarios (no
bottleneck) and MV-DNN in classification layer efficiency (hinge loss < softmax). Its decoder overhead is offset by end-to-end optimization, making it the most practical choice for multi-view intrusion detection.
Inference complexity focuses on forward propagation, which is critical for real-time applications. For the three models, the inference FLOPs are derived as follows.
During inference, AE-NSVM omits decoders, retaining only encoders, the bottleneck, and hinge loss layers. Its inference complexity is given by Equation (
19):
where
N is the inference sample size,
is the total FLOPs of a view-specific encoder, and
is the FLOPs of the hinge loss inference.
AE-SVM requires AE feature extraction followed by independent SVM inference, leading to higher latency. Its inference complexity is expressed as Equation (
20):
where
N,
, and
are defined as above, and
is the FLOPs of the SVM inference.
MV-DNN performs full forward passes through DNN encoders, concatenation, and FC layers during inference. Its inference complexity is given by Equation (
21):
where
N,
,
, and
are defined identically to training complexity.
Compared to AE-SVM, AE-NSVM offers a critical advantage in training efficiency by eliminating the high-complexity term associated with independent SVM training (due to quadratic programming), instead integrating a lightweight end-to-end hinge loss layer that avoids such exponential complexity. In contrast to MV-DNN, AE-NSVM achieves superior inference efficiency by omitting decoders during deployment, retaining only encoders and the hinge loss layer. The hinge loss layer further accelerates inference with its linear operation, making AE-NSVM faster and more scalable for real-time intrusion detection tasks.
3.5. Federated Learning
Federated learning offers the advantage of training machine learning models with decentralized data sources while preserving data privacy. By allowing participants to collaborate and contribute their local knowledge without sharing raw data, federated learning ensures confidentiality while improving model accuracy and adaptability. This approach enables scalable and efficient training, making it ideal for intrusion detection.
Figure 4 is the proposed architecture of AE-NSVM for federated learning. Assuming there are
n clients denoted as
, FL includes the following five steps:
A global model G is initialized on the centralized server;
The parameters of global model G are sent to each client;
Each client fine tunes the global model on its local data to obtain the updated model ;
The parameters of the updated model are sent back to the server and aggregated to form a new global model;
The process from Step 2 to Step 4 is iterated until the convergence or iteration number T is reached.
For
n clients, federated learning aims to optimize the following equation:
where
f is the global optimization objective,
is the parameters of local model
, and
is the objectives defined by the local client
.
To solve the federated optimization problem, model
is trained on local client
to find a optimized
, and then
is sent to the server for aggregation with algorithm
(FedAvg [
42] in this paper) to obtain the global parameter
:
The global parameter is then distributed to clients as its new . Each client trained its corresponding local model with this new . The clients and server repeat these processes until converges to or the iteration number T is reached.
converging to
means the value of
approaches
infinitely, which can be described by the following equation:
where
is the overall difference between
and their average
.
Furthermore, each objective
is the optimization for the loss function from Equation (
12) using an algorithm such as SGD (stochastic gradient descent).