Deep Learning-Enabled Heterogeneous Transfer Learning for Improved Network Attack Detection in Internal Networks

Featured Application


Introduction
In recent years, with the ubiquitous application of internet and mobile communications [1], network attackers have more opportunities to compromise devices and applications for sabotaging the infrastructure or stealing valuable data [2].In addition, due to system and application vulnerabilities, attack methods have been evolving to be more and more sophisticated, which poses an unprecedented challenge to the field of cybersecurity.The ability to detect and respond to these novel and evolving threats in real time has become critical for safeguarding sensitive information and ensuring the integrity of digital assets [3,4].
The realm of network attack detection has been the subject of exhaustive research efforts, yielding a plethora of innovative approaches tailored to the task of identifying and categorizing malicious activities with remarkable precision.Within this landscape, traditional techniques embedded within intrusion detection systems (IDS), including signature-based and anomaly-based methods, have played a pivotal role in shedding light on well-established attack patterns, thus contributing significantly to the field's knowledge base [5][6][7].However, these time-honored methods are not without their limitations, particularly when confronted with the challenge of recognizing previously unseen or novel attack strategies.One notable drawback of these conventional approaches arises from their reliance on static signature databases or normal patterns.However, signature-based methods require ongoing database curation, which places a substantial burden on human experts, demanding their vigilant efforts in identifying, analyzing, and cataloging new attack signatures.To maintain a high level of accuracy in attack detection for anomalybased methods, these normal patterns must be constantly updated to encompass emerging normal behaviors with threats and evolving attacks.Moreover, the process of updating the signature database introduces an inherent time delay, which can hinder the timely detection of emerging threats and compromise network security [8].These challenges underscore the critical need for more adaptive and proactive approaches in the ever-evolving landscape of network attack detection.
To address this ever-growing concern, researchers have turned to machine learning techniques to extract features and build models [7,9,10].Machine learning, particularly deep learning, has demonstrated its potential in learning intricate patterns and features from data, making it an attractive approach for network security applications [11,12].Deep learning has established itself as a formidable force, showcasing impressive achievements across a spectrum of domains that encompass computer vision, natural language processing, and speech recognition.In network detection, deep learning can be used for representation learning to automatically discover the features needed for detection, as the collected data features do not directly reflect use behaviors in networks.It can also be used to learn complicated user behavior sequences with a recurrent neural network (RNN) and the recently popular Transformer and learning relationship and interactions between entities in networks for anomaly detection with a graph neural network (GNN) [11].The construction of robust and accurate classification models typically relies on substantial amounts of labeled data to effectively capture the intricacies of network attacks.However, in the context of internal networks, acquiring sufficient labeled data for training presents a significant challenge due to the sensitive nature of the data and the inherent difficulty in obtaining real-world attack samples [13].As a result, the model training procedure always receives insufficient data samples from outdated datasets, which significantly degrades the performance of machine learning algorithms.
To overcome this limitation, researchers have turned their attention to transfer learning as a viable solution [14][15][16][17][18][19].Transfer learning leverages knowledge from source domains with labeled data and applies it to target domains where labeled samples may be scarce or entirely absent.This approach has shown promise in network attack detection tasks, as it facilitates the extraction of relevant knowledge from external networks to improve the detection capabilities of models operating in internal network environments.
In the context of an internal network, where the availability of labeled data is notably scarce, the endeavor of constructing a robust prediction model for the detection of network attacks presents a formidable challenge.Given this predicament, it becomes imperative to explore avenues for model enhancement, and one such approach involves the integration of data gleaned from external networks, leveraging the principles of transfer learning.However, the introduction of data from external sources brings forth a significant caveat: the inherent diversity among networks often results in disparities within the feature space characterizing their respective data collections.
The main hurdle in applying transfer learning for network attack detection lies in addressing the heterogeneity in feature spaces and probability distributions between source and target domains.Given the potential for variations in communication network types, service categories, and data acquisition techniques across the two domains under consideration, it is conceivable that the feature spaces collected from these networks could exhibit disparities.These distinctions may arise due to fundamental distinctions in the architecture, protocols, and operational objectives inherent to each network, all of which influence the types of data collected and the resulting feature representations.Consequently, these disparities in feature spaces pose a fundamental challenge when attempting to align distributions and extract valuable insights from datasets of these distinct network domains.Specifically, the crux of the matter is how to identify an intermediary, universally applicable data representation capable of bridging these discrepancies across disparate feature spaces.In addition to the disparate feature spaces, an equally critical objective is to align the probability distributions inherent in datasets originating from corresponding networks, similar to most domain adaptation work [20,21].Recent studies have highlighted the importance of aligning these distributions to ensure the effective transfer of knowledge and improve model generalization.However, few studies have focused on deep learningenabled heterogeneous transfer learning [14], which can overcome these challenges by learning common knowledge from domains with different feature spaces.
In this paper, we propose a novel deep learning-enabled heterogeneous transfer learning model tailored explicitly for network attack detection in internal networks.In network detection, deep learning can be used for representation learning to automatically discover the features needed for detection, as the collected data features do not directly reflect use behaviors in networks [22].By applying transfer learning, we try to align the probability distribution of the source and target domains so that the data from the source and target domain can be used in the same model without concept drifting [21,23].The main contribution of this article is summarized as follows:

•
Two feature projection networks are built for the source and target domains, transforming heterogeneous feature data into a shared, unified feature space.By learning domain-specific representations, our model effectively mitigates feature space heterogeneity and establishes a foundation for seamless knowledge transfer.

•
We employ the maximum mean discrepancy (MMD) technique [24] along with the classification loss as the optimization objective for the model so that it forces the alignment of probability distributions between domains during model training.One notable advantage of our proposed model is that MMD computation can leverage the samples' unconditional distribution by utilizing the vast number of unlabeled samples in the target domain, which is common for collected datasets in internal networks.• Additionally, we apply soft classification to the unlabeled data, using the classification sub-network to compute MMD over classes, thereby aiming to align conditional distributions between domains more effectively [20].

•
To validate the effectiveness and generalizability of our approach, we conduct multiple transfer learning tasks between diverse datasets, including the widely used NSL-KDD, UNSW-NB15, and CIC-IDS2017 datasets [25].
Through rigorous experimentation, we demonstrate substantial improvement in crossdomain attack detection accuracy in various learning scenarios, validating the efficacy of our proposed method.As the proposed method eliminated the requirement for massive labeled data in the target network by transferring knowledge from heterogeneous source networks, it lays a good foundation for the application of deep transfer learning in internal network attack detection.
The remainder of this paper is organized as follows: Section 2 provides an overview of related works in the fields of network attack detection, transfer learning, and deep learning techniques.Section 3 details the methodology and architecture of our proposed model.Section 4 presents the experimental setup and evaluation results and discusses the findings and analyzes the performance of the model.Finally, Section 5 concludes the paper with a summary of contributions and highlights potential future research directions.

Related Work
In this section, we provide an overview of the related works in the fields of network attack detection with machine learning, deep learning, and transfer learning techniques.We briefly review existing studies that have attempted to address the challenges of network attack detection in internal networks and those that have explored transfer learning to improve model performance in the presence of limited labeled data.

Machine Learning for Network Attack Detection
In response to the limitations inherent in signature-based approaches, the research community has increasingly embraced the application of machine learning techniques to analyze system logs and traffic data.The overarching goal of this approach is to construct a robust prediction model, which can subsequently be employed to effectively differentiate and classify instances of attacks from normal network behaviors [26].
Within the realm of machine learning-based network attack detection, particular attention has been accorded to supervised learning algorithms, owing to their demonstrated capacity for achieving high accuracy.The supervised methods rely on labeled data, employing them for rigorous training to fine-tune their predictive capabilities, ultimately facilitating the accurate detection of network attacks [27].In Ref. [28], the authors eliminated highly correlated features and evaluated three algorithms, i.e., SVM, artificial neural network (ANN), and AdaBoost with decision tree, on the preprocessed dataset.In particular, the AdaBoost model uses decision trees as the weak learner and updates weights using the AdaBoost algorithm.Comparative analysis shows the AdaBoost model outperforms previous methods such as ANN and SVM.
As deep learning techniques gradually became popular in recent years, they were also applied to network intrusion detection.In Ref. [29], the authors propose to reconstruct the traffic data logs as two-dimensional image features and then apply CNN and CNN-LSTM separately on image data to perform network intrusion detection.The results are better than previous IDS methods, which verify the efficacy of the adoption of CNN.
However, it is essential to acknowledge a fundamental challenge that looms over the adoption of these machine learning-based methods.The performance of these algorithms is intrinsically linked to the availability of expansive and meticulously labeled datasets, a resource that tends to be in short supply within the complex and dynamic landscape of real-world internal network environments.This scarcity of large-scale, accurately labeled datasets underscores a significant hurdle that researchers and practitioners must grapple with as they strive to deploy effective machine learning solutions for network attack detection in practical settings.In Ref. [30], the author proposed an unsupervised deep learning approach for insider threat detection from system logs.They trained deep neural network (DNN) and recurrent neural network (RNN) models to learn normal user behavior and detect anomalies.The models are trained in an online fashion on streaming log data so that they can adapt to changing user patterns.The model output anomaly scores in the 96th percentile, and anomalies can be explained by decomposing the score into contributions from individual features, which reduces analyst workload significantly.

Transfer Learning
Supervised techniques necessitate a substantial amount of labeled data, while they demand significant labor and time when gathering data within an organization's internal network.Furthermore, because cyberattacks exhibit diverse patterns, the network behavior distribution fluctuates, rendering pre-built models ineffective, and thus it requires repeatedly retraining models with fresh labeled data.To overcome the scarcity of labeled data in the target domain, transfer learning has emerged as an effective approach.Transfer learning aims to transfer knowledge from a source domain with abundant labeled data to a target domain with limited labeled data [31,32].
Transfer learning approaches are categorized into three classes based on the nature of knowledge transfer: instance-based, model-based, and feature-based.In the realm of instance-based methods, the objective is to harness the potential of data samples from a source domain to enhance the learning process in a target domain.A notable illustration of this approach is the TrAdaBoost framework, which employs a small volume of fresh data to selectively filter out outdated data distributions [33].This is achieved through iterative updates of sample weights, guided by the predictive errors of a basic learner.Model-based methods, on the other hand, focus on extracting deep learning model parameters that can be effectively shared across different domains.In the realm of feature-based methods, the goal is to identify a common latent feature space where the mapped samples from each domain exhibit closely aligned probability distributions.Bukhari et al. explore this approach by selecting covariate invariant features between training and testing datasets and subsequently employing linear discriminant analysis (LDA) for dimensionality reduction [34].Meanwhile, Pan et al. propose embedding data samples into a reproducing kernel Hilbert space (RKHS) while minimizing the maximum mean discrepancy (MMD) between domains [35].This process leads to the derivation of a kernel matrix, which, in turn, yields low-dimensional representations of data samples using kernel-PCA.Moreover, the authors also proposed the transfer component analysis (TCA) method that takes a unified approach to kernel learning, aiming to attain a low-rank representation by minimizing distribution disparities while preserving data variance [36].This technique offers a comprehensive means of knowledge transfer by combining distribution distance minimization with data variance preservation.

Deep Learning for Transfer Learning
Within the specific realm of transfer learning, the spotlight has intensely shone upon deep learning techniques as a means to harness knowledge acquired in one domain and effectively apply it to another.This pursuit has driven extensive research endeavors, seeking to unlock the potential of deep learning in facilitating knowledge transfer between related domains.The motivation lies in the realization that the complex and hierarchical representations learned by deep neural networks can be instrumental in unraveling the intricacies of diverse domains, thus expanding the horizons of what is attainable in the field of transfer learning.
One popular approach is to use deep neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), as feature extractors.These networks learn hierarchical representations of data, which can be fine-tuned for specific tasks in the target domain [37].One remarkable example of model-based transfer learning is demonstrated by Long et al., who introduce adaptive layers into deep neural networks (DNNs) constructed from source domain data [21].Subsequently, these adaptive layers are trained using target domain data.The key insight here lies in sharing the structure of the DNN and its weight parameters across domains, thus extracting common knowledge to enhance learning efficiency in the target domain.
Some studies on transfer learning have focused on domain adaptation methods, which attempt to align the source and target domains by reducing domain shifts.Pre-trained models, such as deep neural networks trained on large-scale datasets such as ImageNet, have been fine-tuned for specific tasks in the target domain, enabling efficient knowledge transfer.Domain adversarial neural networks (DANN) [38] and discrepancy-based domain adaptation [39] are examples of approaches that learn domain-invariant features and align distributions across domains.Other works have explored the use of pre-trained models for transfer learning [37].While these studies have demonstrated the potential of deep learning in transfer learning, few have explored its application in the context of heterogeneous transfer learning for network attack detection in internal networks.

Transfer Learning in Network Attack Detection
Despite notable advancements in transfer learning techniques for network attack detection, there remains a pressing need to confront the intricacies associated with feature space disparities and distribution heterogeneity, particularly within the confines of internal networks.Researchers have embarked on a journey to explore an array of transfer learning methodologies, all geared toward bolstering the effectiveness of network attack detection in various scenarios.
One notable endeavor in this arena was undertaken by Zhao et al., who introduced the concept of HeMap (heterogeneous mapping [40]) into the realm of network attack detection [15].Expanding upon this idea, they devised a strategy involving the pre-clustering of data samples across different domains.The primary aim was to mitigate the influence of mismatched samples.However, a critical limitation emerged in the form of cross-domain distance computation, which relied upon a heterogeneous feature space transformation facilitated by principal component analysis (PCA).Regrettably, this approach failed to accu-rately reflect the genuine distance between cross-domain samples, ultimately culminating in suboptimal results.
In a parallel vein, previous work conducted by us introduced an alternative approach.In this method, the linear projection was employed for its computational simplicity to transform heterogeneous data into a shared latent space [41].This transformation was followed by the application of maximum mean discrepancy (MMD) to align the distribution patterns across domains.While this approach offered certain advantages, such as computational efficiency, its overall performance remained constrained due to its reliance on linear projection, which could not fully capture the nuances inherent in the complex relationships within the data.More advanced techniques for finding an effective shared feature space could help overcome heterogeneity issues.In addition, distribution alignment methods tailored for network traffic data characteristics could better match distributions.Exploring nonlinear projections and more domain-specific alignment metrics are promising directions for improving transfer learning in this application.

System Design and Methods
In light of the existing research gaps, our proposed model addresses the feature space and distribution heterogeneity by utilizing deep learning-enabled heterogeneous transfer learning.
To illustrate the deep transfer learning model, we first define the notations for datasets.A transfer learning domain is defined as a dataset with its probability distribution.Thus, the source domain is defined as where the dataset X s is a d s -dimensional feature with C classes and P s is the associated probability distribution of the dataset.The dataset consists of n s data samples, which is denoted as Particularly, we consider binary a classification task, wherein the class label is set to {0, 1}.Similarly, the target domain is defined as where the dataset X t is a d t -dimensional feature with C classes and P t is the associated probability distribution of the dataset.The dimension of feature space might be different from the source domain, i.e., d t = d s , which hinders the direct alignment of probability distribution of two domains.Even if the dimension sizes might be the same, there is still a large chance that the original features have different meaning in two domains.In order to reflect the scarcity of labeled data in the target domain, the data samples in the target domain dataset are further divided into a labeled subset X L with n l samples and an unlabeled subset X U with n u samples.Note that n l n u , meaning the majority of the target domain data are unlabeled.Thus, the target data samples are denoted as The proposed method only tries to address the heterogeneous transfer learning task with heterogeneous feature spaces, but the learning task (classes for the data) should be the same cross-domain.That is to say, the two datasets should share the same labels (as we denoted above, both datasets have labels of C classes).The heterogeneous transfer learning task with both heterogeneous feature space and learning task is out of the scope of this article and should be investigated in future research.
To effectively tackle the inherent challenge of lacking labeled training data, our approach makes the best of labeled datasets from the external internet.At the same time, we incorporate a small, yet invaluable, portion of labeled data along with the bulk of unlabeled data available from the internal network.The incorporation of the two heterogeneous sources of datasets is realized by devising a deep neural network model and formulating the problem of network attack detection as a binary classification task.Particularly, by employing feature projection networks for each data source, the heterogeneous feature spaces are transformed into a common latent space.Thus, the transfer of attack detection knowledge from external networks to internal networks is possible via minimization of maximum mean discrepancy (MMD).

Network Architecture Design
To address the problems of feature space heterogeneity and probability distribution misalignment, we introduce the deep learning-enabled heterogeneous transfer learning framework, depicted in Figure 1.Due to the inherent complexity of real-world data, linear projections or shallow networks may be inadequate for capturing the intricate relationships within the data.Thus, we adopt a distinct approach by employing separate deep networks f s (•), f t (•) to facilitate the conversion of data from both the source and target domains into a shared feature space, leveraging the enhanced capacity for nonlinear feature transformation offered by deep neural networks.
Then, the transformed labeled data stemming from both the source and target domains serve as input for the classification network f c (•).This ensures that the resulting model is robust and versatile, capable of effectively discerning patterns and making predictions on data originating from the target domain, a crucial requirement for successful transfer learning.Simultaneously, on the other side of this paradigm, the transformed data play a pivotal role in the calculation of the maximum mean discrepancy (MMD).This statistical metric serves as a vital tool for aligning the probability distributions across the two domains.By reducing distribution discrepancies, MMD facilitates the integration of unlabeled data from the target domain into the classification task, further enhancing the model's capability to make informed predictions based on this previously untapped data source.This dual-pronged approach capitalizes on the strengths of deep neural networks and statistical alignment techniques to optimize the utility of data from both domains in the context of the classification task.

Feature Projection Networks
In order to effectively manage the inherent heterogeneity present within the feature spaces of both the source and target domains, we devised a comprehensive strategy involving the creation of two distinct feature projection networks, f s (•) and f t (•), one meticulously tailored to each domain's unique characteristics.Each network's input is drawn from the corresponding dataset, i.e., and the projected output is given by Xs = f s (X s ), Then, the labeled projected data [ Xs ; Xl ] together with their corresponding labels Ỹ = [Y s ; Y l ] are sent to the classification network, where At the same time, all the transformed data are sent to compute the maximum mean discrepancy (MMD) in order to align the probability distribution of the project data from two domains.
These dedicated networks assume the pivotal role of not only acquiring domainspecific representations but also orchestrating the transformation of input feature data into a unified and shared feature space.This concerted effort is strategically engineered to alleviate the potentially detrimental effects stemming from differences in feature spaces, thereby safeguarding the model's overall performance.Each of these feature projection networks is thoughtfully structured, comprising a cascade of fully connected layers, each with its own set of learnable parameters.This hierarchical arrangement empowers the networks to progressively acquire increasingly abstract and discriminative representations of the input data, ensuring that the nuances and subtleties of the feature space peculiar to each domain are effectively captured.
This meticulous process yields the output of the feature projection networks, which manifests as the transformed feature data.These transformed data find their home in the shared and harmonized feature space, where they seamlessly coexist with their counterparts from the other domain.This pivotal transformation effectively bridges the feature space gaps between the source and target domains, laying the foundation for the smooth and effective transfer of knowledge and insights across domains.

Classification Network
The transformed features from the feature projection networks are fed into the classification network, f c (•), which is responsible for predicting the corresponding labels.The classification network plays a crucial role in extracting meaningful patterns from the transformed features and making accurate predictions.It consists of several fully connected layers and a soft-max layer, with cross-entropy loss as the classification loss function.The classification loss is formulated as where X = [ Xs ; Xl ] consists of the transformed labeled data samples from two domains, Ỹ = [Y s ; Y l ] is the corresponding labels, f c (•) is the classification network, and L(•) is the cross-entropy loss function.By using the transformed features, rather than the original data, as input to the classification network, the model can benefit from the aligned feature representations and generalize better in the target domain with limited labeled data.

Distribution Alignment
To further mitigate the challenges posed by distribution heterogeneity between the source and target domains, we have incorporated a crucial mechanism: maximum mean discrepancy (MMD).This strategic addition is designed to effectively align the probability distributions of data in both domains, thus enhancing the model's performance in a transfer learning context.To delve into the specifics, our approach involves the computation of MMD, which serves as a non-parametric metric for quantifying the dissimilarity between probability distributions.This measure plays a pivotal role in quantifying the extent of divergence or alignment between the distributions of the transformed data originating from the source and target domains.
What sets our approach is the utilization of not only labeled samples but also the substantial pool of unlabeled data instances found within the target domain.This innovative strategy allows us to harness the wealth of unlabeled data for the purpose of computing cross-domain MMD, effectively leveraging a vast and previously untapped resource.During the training process, a key objective is to minimize the MMD value systematically.This optimization criterion serves as a guiding principle for the model, compelling it to actively align the probability distributions characterizing the two domains.Through this alignment process, the classification model is primed to generalize effectively, demonstrating robust performance when applied to unlabeled target data samples.This harmonization of distributions across domains serves as a critical bridge that allows the model to transfer knowledge seamlessly and adapt successfully to the intricacies of the target domain.
The MMD is calculated as the distance of two centroids corresponding to two datasets, which is expressed as: where xs i ∈ Xs and xt i ∈ Xt are the transformed samples output by the corresponding projection networks.
To more effectively align distribution, conditional distributions between domains can also be aligned via minimizing centroid distance between corresponding classes across domains.Though we only know a few labeled data in the target domain, we can apply a pseudo classification to the unlabeled samples by reusing the classification network f c (•).By using the pseudo classification for the unlabeled data, we compute the MMD over classes, i.e., where xs k,i ∈ Xs , xl k,i ∈ Xl are the labeled projected data samples belonging to the kth class of the source and target domain, while xu k,i ∈ Xu is the data sample of the kth pseudo class of the target domain.Correspondingly, n k s , n k l , and n k u are the number of samples of the kth class.This approach helps to capture underlying similarities and differences between the classes, contributing to the overall improvement in transfer learning performance.
Instead of assigning hard labels according to the pseudo classification result, soft labeling assigns probability distributions over classes to each unlabeled sample, which results in a more stable iteration process and avoids negative transfer.At the initial stage, the untrained classification network makes random guesses for unlabeled data, thus the minimization of conditional distributions distance takes little effect.As the iteration proceeds, the accuracy of the classification network will improve for unlabeled samples, thus boosting the minimization of conditional distributions.This is why choosing the soft labeling scheme is better than hard labeling.Furthermore, we can introduce a weight for the soft labels, and the weight increases with the iteration procedure, i.e., w r = r R , where R is the total number of iteration stage and r is the current stage number.Introducing the soft labeling and iteration weight for conditional distribution distance, the MMD over classes can be rewritten as 3.1.4.The Optimization Objective of the Transfer Learning Network The overall loss function of our proposed model consists of the classification loss and the MMD-based distribution alignment loss (including MMD loss for both unconditional and condition distributions).The optimization objective is to jointly minimize the classification loss of labeled data and minimize the distribution distance in terms of MMD.Therefore, the optimization objective can be expressed as: where α is a coefficient to adjust the relative importance of classification accuracy of labeled data and the distribution alignment across domains.During the training procedure with stochastic gradient descent-based methods, we iteratively update parameters of the feature projection networks and the classification network.

Performance Evaluation
In this section, we describe the experimental setup and evaluation process used to assess the effectiveness of our proposed deep learning-enabled heterogeneous transfer learning model for network attack detection.We conduct multiple transfer learning tasks between diverse datasets and present the results to validate the model's performance in various learning scenarios.

Datasets
We perform our experiments on three widely used and publicly available network intrusion detection datasets: NSL-KDD, UNSW-NB15, and CIC-IDS2017.

1.
NSL-KDD (NSL-KDD Cup'99 Dataset) is a widely used dataset in the field of network intrusion detection and security.It is an improved version of the original KDD Cup '99 datasets, designed to address some shortcomings in the latter, such as redundancy and unrealistic traffic patterns.NSL-KDD contains a large collection of network traffic data, including both normal and various types of malicious activities (i.e., DoS, Probe, R2L, and U2R attacks), making it a valuable resource for training and evaluating intrusion detection systems.

DoS
Involves overwhelming a network or system to disrupt its services.
Probe Attackers attempt to gather information about the target network without direct exploitation.

U2R
Attackers exploit vulnerabilities to gain unauthorized access and escalate privileges.

R2L
Attackers attempt to connect to a local system remotely without proper credentials.

UNSW-NB15
Fuzzers aimed at testing software vulnerabilities through unexpected inputs.

Analysis
Techniques for gathering information about target systems.

Backdoors
Unauthorized access methods left by attackers.

DoS
Flooding a system to disrupt its services.

Exploits
Attacks exploiting known vulnerabilities.

Generic
General or unspecified attacks.
Reconnaissance Preparing for future attacks by gathering information.

Shellcode
Malicious code for executing arbitrary commands.

DoS
Flooding the target with traffic to disrupt services.

PortScan
Scanning target ports to find potential vulnerabilities.

DDoS
Distributed denial of service attacks from multiple sources.
Patator Brute-force attacks against SSH and FTP services.
Web Attack Attacks targeting web applications and services.

Botnet
Activities related to botnets, including command and control traffic.

Infiltration
Unauthorized access and data exfiltration attempts.

Transfer Learning Tasks
We implement our proposed model using the PyTorch deep learning framework.We perform two groups of transfer learning tasks to demonstrate the efficacy of our model in different learning task settings and compare with related transfer learning methods.

UNSW-NB15 to CIC-IDS2017 Transfer Learning
In order to demonstrate the efficacy of the proposed transfer learning framework, we apply it to multiple scenarios with two datasets.Specifically, we use the UNSW-NB15 dataset as the source domain for training and transfer the knowledge to the CIC-IDS2017 dataset as the target domain with different attack scenarios.Through a grid search procedure for tuning hyper-parameters, we decide to set the number of hidden layers of the feature transform network for source and target to 2 and 3, respectively, set the dimension of common feature space to 128, and set the number of hidden layers of the classification network to 4, and ReLU is used as the activation function for all hidden layers.The training process includes four epochs.
According to the attack characteristics summarized in Table 1, we divide data samples of similar attack types for source and target datasets into several groups to simulate multiscenario cross-domain transfer learning.In each group of source and group dataset, data labeled with normal/benign are negative samples, while others are positive samples.The transfer learning settings are summarized in Table 2.We employ several standard evaluation for binary classification including accuracy, accuracy, recall, and F1-score.We present the results of our experiments in Table 3, including the performance metrics for each transfer learning task (identified by the TaskID (#)).In various transfer learning scenarios, the prediction accuracy is consistently high, i.e., achieving 98%, and the recall achieves as high as 93%.On the other hand, the precision and F1-score only achieve medium rate (above 43% and 59%, respectively).As the performance metrics are defined as we can conclude that the method works well for attack detection, i.e., detecting the true positive data samples, while there is a relatively high possibility to mis-detect normal data samples as attack instances (the false positive rate).The reason why the false positive rate is high might be that the dataset is extremely unbalanced, that is to say, the communication sessions in those datasets comprise only a minor portion of attack instances.To resolve this issue, we may try to up-sample the attack data instances to reduce the imbalance.In addition, we can try to increase the weight of attack instances during the training stage.
During tuning of the hyper-parameters, we compared the results of using several variants of ReLU activation functions, i.e., Leaky ReLU, Parametric ReLU, and ELU, to determine if the model is influenced by "dead neuron" due to negative input.The results are show in Figure 2.
Though Leaky ReLU, Parametric ReLU, and ELU introduce some mechanism to eliminate zero gradients for negative input, from the results, we can see that ReLU does not degrade the model's performance.Hence, in this experiment, it is safe to use ReLU as an activation function.However, for a wider-range application of the learning model, using those modified ReLU variants usually achieves better stability in terms of the model's performance.

NSL-KDD to UNSW-NB15 Transfer Learning
In this task, we train the model on the NSL-KDD dataset (source domain) with abundant labeled samples and transfer the knowledge to the UNSW-NB15 dataset (target domain) with limited labeled data.Different from the previous experiments, as the datasets change, due to a relatively small dataset of NSL-KDD, we only configure one hidden layer for the source and target feature transform networks, and the classification network has three hidden layers.The dimension of common feature space, i.e., the number of units of output layer in the feature transform networks, is configured to 256.The training process includes three epochs, and each epoch contains multiple iterations, which depend on the batch size (set to 1024) and how much data we have in the dataset.
We compare the proposed method with several methods that are mentioned in Ref. [41] to validate the superiority of our proposed method (denoted as dhetl), including: (a) The hemap method, which employs linear projection to transform the diverse source and target feature space into a shared latent space, concurrently minimizing projection errors and sample distances across different domains.(b) The hetl [15], which is similar to hemap but includes clustering target data before each iteration.(c) The base approach, which entails the direct training of the source domain while subsequently applying predictions to target domain data.This is accomplished by orchestrating the transformation of both source and target data into a shared feature space through principal component analysis (PCA).(d) The hemmd method [41], which is similar to hemap but minimizes cross-domain distribution distance with measurement of MMD.
In total, seven transfer learning tasks are constructed.In each task, data samples belong to normal and an attack class are selected to represent source and target domain data from the NSL-KDD dataset and UNSW-NB15 datasets.To compare with existing methods, we employ several standard evaluation metrics for binary classification tasks, including accuracy.The results are shown in Table 4. From the results, we can see that the proposed dhetl has the highest prediction accuracy in the given transfer learning scenarios.Except for dhetl, hemmd has the highest accuracy compared with other methods, which has been analyzed in [41].Compared with hemmd, the main improvement of dhetl is attributed to the nonlinear projection of feature spaces to common space, which is more expressive than the linear projection in hemmd.In addition, dhetl trains the network by optimizing the classification loss, which is directly related to the learning task, while hemap is not since it optimizes projection loss.Furthermore, the compared methods only utilize partial data from the dataset to optimize their model, while the proposed method uses all available data to train a model.Therefore, we have verified that the proposed method outperformed other methods.

Conclusions
In this paper, we have proposed a deep learning-enabled heterogeneous transfer learning model for network attack detection in internal networks.Through feature transformation and the training procedure to minimize classification loss and align probability distribution, we finally obtain a model that achieves the highest detection accuracy among compared methods.
We also find, though it can achieve high detection accuracy for attack instances, the mis-detection rate for normal instances is still at a moderate level.The reason might be that both the source and target datasets are highly imbalanced.Hence, in future work, we need to work hard for a solution to address this imbalance to further enhance the performance of the proposed deep learning-enabled heterogeneous transfer learning model.

Figure 1 .
Figure 1.Deep network architecture for heterogeneous transfer learning

Figure 2 .
Figure 2. Influence of ReLU activation functions to model training.
DDoS, and Port Scans.It offers a wide variety of network traffic scenarios, including both benign and malicious traffic, across different network protocols.This dataset is particularly valuable for researchers and practitioners working on cybersecurity, as it helps in the development and assessment of effective intrusion detection and prevention mechanisms.We summarize the main attack types in each dataset in Table1so that we can choose similar attack types from two datasets to simulate the cross-domain transfer learning task.

Table 1 .
Summary of the attack types in each dataset.

Table 3 .
Performance metrics for transfer learning from UNSW-NB15 to CIC-IDS2017.

Table 4 .
The accuracy of cross-domain network attack detection.