End-to-End Network Intrusion Detection Based on Contrastive Learning

The network intrusion detection system (NIDS) plays a crucial role as a security measure in addressing the increasing number of network threats. The majority of current research relies on feature-ready datasets that heavily depend on feature engineering. Conversely, the increasing complexity of network traffic and the ongoing evolution of attack techniques lead to a diminishing distinction between benign and malicious network behaviors. In this paper, we propose a novel end-to-end intrusion detection framework based on a contrastive learning approach. We design a hierarchical Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) model to facilitate the automated extraction of spatiotemporal features from raw traffic data. The integration of contrastive learning amplifies the distinction between benign and malicious network traffic in the representation space. The proposed method exhibits enhanced detection capabilities for unknown attacks in comparison to the approaches trained using the cross-entropy loss function. Experiments are carried out on the public datasets CIC-IDS2017 and CSE-CIC-IDS2018, demonstrating that our method can attain a detection accuracy of 99.9% for known attacks, thus achieving state-of-the-art performance. For unknown attacks, a weighted recall rate of 95% can be achieved.


Introduction
The past two decades have witnessed the rapid development of network technology.Especially in recent years, the application of 5G technology has landed, making it possible to connect everything.In 2020, the number of active Internet of Things (IoT) devices worldwide reached 15.1 billion, and this number is expected to grow to 29.42 billion by 2030 (https://www.statista.com/statistics/1101442/iot-number-of-connected-devicesworldwide/,accessed on 21 March 2024).Accordingly, cyber threats are becoming more serious.The network intrusion detection system (NIDS) is one of the most important types of security devices in cyberspace and has been widely used in safeguarding various information systems.A NIDS continuously monitors incoming and outgoing network traffic, analyzing it to detect cyberattacks [1].
In general, network intrusion detection methods can be broadly classified into two main categories: signature-based IDS and anomaly-based IDS [2][3][4][5].Signature-based IDS involves the generation of signatures for known attacks, which are then compared against incoming data instances [6][7][8].These methods offer the advantages of high detection precision and a low false alarm rate; however, they may be limited in their ability to detect novel attacks.On the other hand, anomaly-based IDS focuses on detecting deviations from expected behaviors by constructing models to assess benign behavior and issuing warnings when a given instance deviates significantly from typical behavior beyond a predefined threshold.
Anomaly-based methods have garnered increased attention due to their superior capabilities in detecting unknown attacks.A significant trend in anomaly-based IDS is the utilization of machine learning for its robust representation capabilities.As network traffic becomes increasingly complex, a primary challenge lies in designing a general and stable representation model that can effectively distinguish anomalies from benign traffic.
Firstly, relying on feature-ready datasets limits the generalizability and feasibility of machine learning-based approaches, which often require feature engineering for typical attacks.Traditional machine learning models, such as support vector machines, decision trees, and K-nearest neighbors, are trained to acquire detection capabilities using these features [9][10][11].Deep learning has also been applied to construct IDSs in recent years.However, many methods still depend on feature-ready datasets [5,[12][13][14].The advantages of building models for feature-ready data are semantically interpretable and easy to train.However, the cost of feature engineering also significantly increases with the number of attack types.The designed features also lose some information.Therefore, some methods attempt to construct automatic feature extraction models [15,16] for the raw traffic.We prefer automatically extracting features from raw traffic because such methods can fully leverage the powerful feature extraction capabilities of deep learning and are more feasible for new types of attacks.
Secondly, the boundary between benign and malicious traffic is becoming increasingly difficult to determine.Attackers are continuously upgrading malware to alter their communication behaviors in order to bypass detection.New attack variants will likely be challenging to detect using previous features, necessitating new feature engineering.Deep learning-based approaches also encounter the issue of adversarial samples.Attackers can treat NIDS as a black-box system and continuously query and adjust the attack traffic features based on feedback until they successfully bypass the NIDS [17,18].In order to enhance the capability of IDS to differentiate between benign and malicious traffic, it is crucial to develop models that accentuate the distinction between benign and malicious samples.
In this paper, we propose CHCG, a contrastive learning-based end-to-end intrusion detection framework with hierarchical Convolutional Neural Networks (CNN) and Gate Recurrent Unit (GRU) networks to extract features from raw traffic.CHCG has the capability to directly process network traffic PCAP data, eliminating the need for supplementary feature engineering.The raw bytes in the packet sequence are used as the input.The bytes of a single packet are converted into a two-dimensional matrix, from which the spatial feature vector is then extracted using CNN.The spatial feature vectors of all packets in the sequence are composed into a vector sequence.The temporal features are subsequently extracted using GRU to produce a representation that encompasses both spatial and temporal features.Contrastive learning encourages the model to encode benign traffic into a compact cluster within the representation space and amplifies the distinction between benign and malicious representations.This, in turn, helps widen the margin of the decision boundary.Finally, the cosine similarity between the representations of the testing and benign traffic is used as the basis for detection.Experiments show that the proposed method exhibits state-of-the-art performance and high accuracy against unknown attacks.Our main contributions are summarized as follows: 1. We design a hierarchical model that combines CNN and GRU networks for the autonomous extraction of spatiotemporal features from raw traffic data.Furthermore, an embedding layer is integrated to enhance the representation dimension of the protocol header bytes, which contain more important information.2. We propose an intrusion detection framework based on contrastive learning.Contrastive loss can offer a wider decision boundary margin compared to cross-entropy loss.Using cosine similarity with benign traffic representations as the basis of detection can lead to higher accuracy and a lower false alarm rate.Moreover, unknown attacks can be detected.3. We introduce the granularity of IP pairs for packet collection.Compared with flow-level, IP pair granularity can present inter-flow characteristics, which has ad-vantages for discovering attack traffic that looks normal in a single flow, such as Dos and PortScan.4. The proposed framework undergoes evaluation using widely utilized datasets, namely, CIC-IDS2017 [19] and CSE-CIC-IDS2018 [20].The results demonstrate that our approach can attain a detection accuracy of 99.9% for known attacks, thus achieving state-of-the-art performance.For unknown attacks, a weighted recall rate of 95% can be achieved.
The rest of paper is organized as follows: Section 2 introduces related works.Section 3 describes the methodology used in our research.Section 4 demonstrates the experiments using CIC-IDS2017 and CSE-CIC-IDS2018 datasets.Section 5 concludes this paper.

Related Work
This section presents the related works on intrusion detection in recent years and mainly focuses on deep learning methods.
Some studies have concentrated on the development of novel features or algorithms for feature selection.Mirsky et al. proposed Kitsune, an online network IDS [21].Kitsune utilizes a set of autoencoder neural networks with unsupervised learning to differentiate between benign and malicious network traffic.When packets are captured, Kitsune extracts their statistical features and reconstructs them with several auto-encoders.The reconstruction error distinguishes between normal and malicious traffic patterns.Similarly, Li et al. proposed an autoencoder-based IDS and applied feature selection and feature grouping algorithms [12].The reconstruction error of autoencoders was used to detect anomalies.However, they only considered the CSE-CIC-IDS2018 dataset, so the evaluation was insufficient.Xu et al. proposed an IDS with GRU networks, multilayer perception (MLP), and softmax module, achieving an overall detection rate 99.42% on the KDD 99 dataset and 99.31% on the NSL-KDD dataset [22].Alzubi et al. combined CNN and LSTM to identify hidden patterns of data, achieving improved accuracy on the CSE-CIC-IDS2018 dataset [14].Wang et al. designed a feature selection approach and generated 143 features for encrypted traffic.They also proposed a detection framework with long short-term memory (LSTM), ResNet, and XGBoost [23].The quality of features dramatically influences the performance of the aforementioned methods, indicating that new attacks may be difficult to detect.
End-to-end methods are becoming a hotspot for fully leveraging deep neural networks, which directly learn features from raw traffic data.Wang et al. proposed an encrypted classification method using one-dimensional CNN [24].The raw bytes of a session or flow are input into the network for automatic feature extraction.They validated the method using the ISCX VPN-nonVPN dataset and demonstrated its effectiveness.Their other work combines CNN and LSTM to learn the packet-level spatial features and the flow-level temporal features [15].The hierarchical spatiotemporal features have better performance in characterizing network traffic behaviors.This network structure is also utilized in other studies.Lin et al. proposed TSCRNN, an encrypted traffic classification scheme that extracts abstract spatial features using CNN and temporal features using stacked bidirectional LSTM [25].Differently, they divided all raw bytes by 255 for normalization, which is more efficient than the one-hot encoding in Wang et al.'s work.Shapira et al. proposed FlowPic [26], transforming packet sizes into grayscale images and utilizing CNN for classification.FlowPic has an advantage in processing speed, but the amount of information extracted from the packet is insufficient for handling complex traffic.PBCNN [16] also adopts a hierarchical structure that utilizes CNN to extract abstract features of individual packets and packets within a flow.Generally, a hierarchical feature extraction structure is considered an appropriate choice for representing raw traffic data.
In recent years, contrastive learning has led to significant advances in self-supervised representational learning [27][28][29].The basic idea is to pull the anchor closer to positive samples in the embedding space and push it away from negative samples.The positive samples are typically data augmentations of the anchor, while the negative samples are randomly selected samples from the minibatch, excluding the anchor and its augmentations.
The lack of labels may cause samples of the same class in the minibatch to be incorrectly regarded as negative pairs.Consequently, supervised contrastive learning [30] redefines positive samples as those from the same class and performs better than cross-entropy loss-based methods.In this paper, we apply supervised contrastive learning to develop a representation model for network traffic using labels.Some works have also proposed IDSs based on contrastive learning.FeCo [31] is a federated-contrastive-learning framework that aims to address users' privacy concerns and develop an accurate model for the benign class.Lopez-Martin et al. [32] represented features and labels within the same vector space.The study aimed to minimize the distance between each label prototype and its sample-class samples.However, the two methods are not end-to-end, and the tested datasets are featureready.Yue et al. [33] proposed a heuristic method to generate traffic augmentations and combined contrastive loss and cross-entropy loss to enhance the model training.However, they only tested the performance on raw traffic using a private dataset and still relied on statistical features from public datasets.
Table 1 provides a summary of the related works discussed in the paper.The proposed method can be executed in an end-to-end manner, eliminating the need for feature engineering.An additional benefit is the ability to detect unknown attacks.

The proposed CHCG
Yes Yes

Methodology
This paper proposes a network intrusion detection framework based on contrastive learning and a hierarchical representation model.The framework aims to solve the following problems: • How can we automate the extraction of general effective features from raw traffic data without the need for complex feature engineering?Extracting useful features from raw traffic is challenging because network behavior typically involves numerous packets, some of which may be encrypted.Considering the intra-and inter-features of packets, we design a hierarchical structure.We utilize CNN to learn the spatial features of a single packet and GRU networks to learn the temporal features of sequential packets.

•
How can a network intrusion detection model be trained to ensure a significant margin between the decision boundary and the training points?It has been shown that cross-entropy loss results in a much smaller margin between the decision boundary and the training points than optimal.As a result, the trained IDS is sensitive to noisy labels, susceptible to adversarial sample attacks, and challenging to detect unknown attack traffic.In this paper, we employ contrastive loss to represent benign traffic more compactly in the embedding space and separate attack traffic from benign traffic, thereby widening the margin of the decision boundary of the trained model.As a result, the trained model becomes more robust in the presence of noisy labels and adversarial samples, enabling the detection of unknown attacks.

Framework
Most machine learning-based IDS methods detect network anomalies by utilizing features extracted from flows, even though some flows may persist for a considerable duration.Therefore, these methods are more suitable for offline detection.In order to detect malicious traffic at an early stage, we aim to identify a specific number of packets as soon as they are captured.CHCG automates the extraction of raw traffic data features using CNN and GRU models without the need to wait for the flow to finish.This strategy avoids the cost of waiting for the end of a TCP connection and extracting statistical features.The framework of CHCG is shown in Figure 1

Capture Packets
Data formats and feature distributions vary significantly across different granularities of network flows.Dainotti et al. [35] summarized five common granularities of network flows, including TCP connections, flows, bidirectional flows, services, and hosts.A flow is typically characterized by the same 5-tuple (source IP, source port, destination IP, destination port, and transport layer protocol) over a period of time.A bidirectional flow is one in which the source and destination of the 5-tuple are interchangeable, and it is also known as a session.Flows and sessions are the most common granularities for network traffic analysis and are typically selected in most studies.The method proposed in this paper uses sessions as one of the detection granularities.
Another granularity used in this paper is IP pairs, also known as host pairs.Dainotti et al. [35] classified host granularity as the predominant traffic generated by a host, under the assumption that bidirectional traffic to and from the host can be observed.We refine this granularity by using the IP address pairs of both communicating parties as the analysis granularity, where source and destination IPs can be exchanged.Thus, it is more relevant than using one host's IP as the object of detection.Another advantage is that the granularity of IP pairs contains multiple sessions, which can reflect inter-flow characteristics.Especially for PortScan and DoS attacks, their intra-flow characteristics are closer to normal traffic, but the inter-flow characteristics exhibit clear anomalies.
Figure 2 illustrates the benefits of IP pair granularity in detecting PortScan attacks.Looking at just a single flow, such characteristics may also appear in benign traffic.When analyzing multiple flows together, anomalies can be detected immediately.Therefore, IP pair granularity can serve as a complement to flow and session granularity.It offers advantages in detecting attacks with distinct inter-flow characteristics.For both granularities, the captured packets are organized sequentially, and the time interval between packets should not be too long.If no new data packets are received within a certain period, such as 2 min, the packet collection process of a session or an IP pair should be halted, and the cache should be cleared to avoid unnecessary occupation of storage.It is reasonable to infer that prolonged waiting signifies the termination of previous communication.Assuming there are n collected packets from the same 5-tuple or IP pair, where p i represents the i-th packet, the packet sequence is represented as P = [p 1 , p 2 , . . ., p i , . . ., p n ].

Segment Packets According to a Fixed Window Size
Now, the collected packets are segmented based on a fixed window size, denoted as w.The objective of segmentation is twofold: to timely identify incoming traffic for alleviating storage constraints and to align the input for the deep learning model.For the session granularity, we select the first w packets within a session.
For the IP pair granularity, the detection window slide over the packet sequence.The sequence within a window can be denoted as P (w) = [p i , p i+1 , . . ., p i+w ].When the sequence has been processed, the window slides to the next position.We define s as the slide step and typically set its value to half of the window size.Before generating a new sequence, we check whether the number of remaining packets is more than s.If not, it is believed that the remaining data packets are too small to warrant further detection.If the number is greater than s, a new window sequence is generated, and empty packets are used to make up the shortfall when the remaining packets are insufficient to form a window.Then we obtain the new sequence P (w) ′ .

.3. Extract Packet Data
Raw traffic data usually refers to all the bytes contained in a packet.For PCAPformatted traffic, the bytes in each packet can be easily separated.These bytes contain the protocol headers and payloads for each layer of the network protocol and cover all communication information.Many studies have selected raw packet data as inputs and demonstrated satisfactory performance [15, 25,36].The advantages include lower costs for feature engineering and faster data processing speed.To minimize distractions, we remove the bytes associated with the MAC address and IP address.It is evident that such private information typically varies with each attack; otherwise, an address filter can effectively block the attack.Moreover, most artificial datasets have a predetermined attack trace or are anonymized due to privacy concerns.
Therefore, our approach takes the first l bytes from the packet, excluding the MAC and IP addresses, as the input for feature learning.If the packet has insufficient bytes, it will be padded with 0 × 00.Finally, we obtain the processed result, denoted as P (w) l .We sampled approximately 2% of the PCAPs from CSE-CIC-IDS2018 and analyzed the distribution of packet lengths, as illustrated in Figure 3.All the packets are within 1500 bytes, which is the Maximum Transmission Unit (MTU) of Ethernet.The percentage of 1500 bytes is 23.75%.Therefore, we set l as 1480 after removing IP and MAC addresses.

Feature Extraction
In this part, we aim to learn the representation of spatiotemporal features of the packet sequence.We first utilize CNN to learn the spatial features of each packet and obtain a feature vector.The output vectors of packets in the detection window form a sequence of vectors, serving as the input for GRU networks to learn temporal features.The output of the representation model is an m-dimensional vector, denoted as v.

Inputs
After data processing, we obtain P (w) l , the object of feature extraction.It can be formalized as: where a i represents the trimmed bytes in packet p i and b j represents j-th byte in a i .

Normalization
Prior to being fed into the model, it is essential to normalize all data.One-hot encoding is utilized in some studies [15,16].One byte is encoded into a 256-dimensional vector, and its effectiveness is empirically validated.Nevertheless, the increase in data dimensionality also incurs additional computational expenses.Some works [16,33] just treat a byte as a numerical value within the range of [0, 255] and normalize it by dividing it by 255.The packet header bytes contain important flags, lengths, or other information and are not always expressed in exactly one byte.For example, in the IPv4 protocol, two bytes are used for total length, and in the TCP protocol, two bytes are used for port numbers.Thus, simply considering one byte as a number within the range of [0, 255] poses challenges in learning the representative packet features.
Here, we employ a more fine-grained approach to normalization by integrating an embedding layer to encode the header bytes into a higher-dimensional space and dividing the other bytes by 255.Then, we concatenate the two parts.The embedding dimension can be selected from 1 to 256, and we choose 16 as a compromise.For a TCP packet, the header bytes typically include 14 bytes for the Ethernet protocol, 20 bytes for the IPv4 protocol, and 20 bytes for the TCP protocol.Therefore, the dimension of the embedding vector for the head bytes is 544 ((54 − 20) × 16).We concatenate all normalized values and reshape them into a 45 × 45 matrix, padding with some zeros.
Finally, the dimension of a i is adjusted to 45 × 45, i.e., a i ∈ R 45×45 .

Model Structures
We use 2D-CNN and GRU-based RNN to learn the spatiotemporal representation of P l .The hierarchical model structure is shown in Figure 4.In this paper, we utilize three CNN models: a basic model with only two convolutional layers, a relatively deeper model with four convolutional layers, and ResNet18, a classic model in the field of computer vision.
CNN converts the raw data of a single packet into a vector, and a vector sequence can be formed after all packets in the window have been processed.The sequence is input into the two-layer GRU networks.The output of the entire model is a 128-dimensional vector representing the spatiotemporal features of the packets in the detection window.

Packet Spatial Feature Extraction
CNN has demonstrated outstanding performance in the field of computer vision and has subsequently been expanded to various domains.In comparison to deep neural networks (DNN), CNN decreases the quantity of weights and computational complexity, thereby enhancing the efficiency of training and prediction.We utilize CNN for the extraction of packet spatial features.To evaluate the impact of the depth of CNN, three CNN models are employed.The first model consists of two standard convolutional blocks, which include convolution, batch normalization, rectified linear unit (ReLU), and maxpooling operations.The second model exhibits increased depth by incorporating four convolutional blocks and eliminating two maxpool layers.The third model utilized in this study is the classical ResNet18 model, which has been adapted to meet the specific requirements by modifying its input and fully connected layers.Subsequently, the vector generated by the CNN will be flattened and its dimensionality reduced to 256 through a linear layer.

Temporal Feature Extraction
RNN plays an important role in learning temporal features.LSTM [37] and GRU [38] are the most commonly used RNN units.Compared to LSTM, GRU has a simpler structure, resulting in lower computational costs.However, GRU can achieve performance as good as LSTM most of the time.Two GRU blocks are used in our model, each consisting of a 256-unit GRU layer and a dropout layer.
After learning the temporal feature, we add a flatten layer to normalize and reshape the feature.Afterward, we utilize another linear layer to decrease the dimension to 128.Finally, we obtain the spatiotemporal feature vector v ∈ R 128 of the windowed packet sequence.

Contrastive Loss
Our goal is to train a representation model so that the representation vectors of the same class traffic fall into a compact cluster, while the representation vectors of different class traffic are far from each other.We assume that the training dataset has been split into minibatches with the size of N, and each instance in the batch has been projected to a feature vector v i ∈ R 128 .The corresponding label is y i ∈ 0, 1, where 0 indicates a benign traffic instance and 1 indicates an attack instance.The supervised contrastive loss function used in our method is shown as Equation (4).
Here, τ ∈ R + represents a scalar temperature parameter, and L ij is the loss of the positive pair (v i , v j ).1(.) is the indicator function, and it equals 1 if the condition is satisfied.
A positive pair is defined as a vector pair with the same label.Then, we sum up all positive pairs to calculate the batch loss L. By minimizing the loss function, we aim to maximize the similarity among representations of the same class and minimize the similarity among representations from different classes.

Detecting
The idea is to calculate the similarity between the representation of the test instance v test and the average vector v of benign representations to determine whether the test instance is benign or not.
First, we generate all the benign representations.Then, we use density-based spatial clustering of applications with noise (DBSCAN) algorithm to remove outliers.Although most benign traffic representations are in a small cluster, some instances are still close to the boundary of intrusion traffic.It is reasonable to remove outliers for higher precision.DBSCAN is widely used in many real-world applications due to its simplicity, efficiency, and robustness [39].For the rest of the vectors, we calculate the average vector v by Equation (5).
where ∥ • ∥ 2 denotes the L2-norm function.The cosine similarity estimator is employed to quantify the similarity between the test vector and the average benign vector.
The similarity sim ranges from 0 to 1, and a value close to 1 indicates high similarity.Finally, we can use a threshold t to determine the category of the test instance.t needs to be manually selected according to specific requirements.A higher threshold leads to more precise detection results, but it also allows more malicious traffic to bypass detection.Conversely, a lower threshold increases the recall rate but also results in a higher false positive rate (FPR).

Dataset and Experimental Setup
We implemented CHCG on the PyTorch platform [40].We ran all the experiments on a computer with an Intel(R) Xeon(R) Gold 6248R CPU, a NVIDIA GeForce RTX 3090 GPU and Windows 10 OS.The datasets used in the experiments include CIC-IDS2017 [19] and CSE-CIC-IDS2018 [20], which are relatively lately published and have been frequently used by recent works.
The CIC-IDS2017 dataset was created by the Canadian Institute of Cybersecurity in 2017 and introduced by Sharafaldin et al. [19].They tried to generate realistic background traffic and used B-Profile system to profile the abstract behavior of human interactions.The naturalistic benign traffic provides a better assessment of the performance in the real world.They implemented several common attacks including BruteForce FTP, BruteForce SSH, DoS, Heartbleed, WebAttack, Infiltration, Botnet, and DDoS.CSE-CIC-IDS2018 was created with a similar method but in a more complicated network topology.The two datasets both provide raw PCAP data, up to 48.8 GB and 444.5 GB, respectively.
Liu et al. analyzed the errors [41] present in both datasets.They analyzed the inconsistency of the dataset labels with the actual attack behavior and provided a modified dataset.However, what they provide is still feature-ready data that cannot be directly used in the experiments.In fact, we split and labeled the PCAP data based on the attack paths, attack times, and other information from official documents, rather than the CSV files provided.According to our analysis, we found some issues that are consistent with the findings of Liu (2022) [41].For example, in CSE-CIC-IDS2018, the execution description of BruteForce FTP can not be matched with the PCAP data, so we removed it from the attack list.Infiltration was divided into two steps: first, the attacker compromises an internal host, and then uses that host to launch a network scanning attack.We extracted the traffic from the latter part as Infiltration.Liu et al. also raises the issue of empty payloads, which we did not distinguish and categorized them all as attacks.
Overall, the dataset we generated by extracting from the original PCAP file eliminates most of the problems in the feature-ready CSV dataset.While both datasets still suffer from the problem that some of the traffic does not exactly match the real attack process, it is reasonable to use them to evaluate our method.
The details of the CIC-IDS2017 and CSE-CIC-IDS2018 datasets at the IP pair granularity are summarized in Tables 2 and 3, including the numbers of total, train, and test samples.Obviously, there is a significant imbalance in both CIC-IDS2017 and CSE-CIC-IDS2018, which is consistent with the real world.The proportion of malicious traffic is much smaller than that of benign traffic, and the proportion of different types of malicious traffic also varies significantly.
It is reasonable to assume that we collected a mass of benign traffic and several categories of malicious traffic.The collected traffic was then labeled to train the model.In the experiments, we divided the attacks into known and unknown attacks based on the volume of their sample numbers.For CIC-IDS2017, we split the benign samples and known attack samples into the training and testing sets with a 1:1 ratio.The known attacks include BruteForce-FTP, BruteForce-SSH, PortScan, DDoS LOIC, DoS Hulk, and DoS GoldenEye.Other attacks were considered unknown and were only used during the testing phase.For CSE-CIC-IDS2018, the number of samples for benign and certain attacks is excessive, and it was unnecessary to use all the samples.The ratio of division between the training set and testing set is also 1:1.However, in the training and test sets, only 1,000,000 benign samples were randomly selected, and the maximum number of samples per attack was limited to 100,000.The known attacks include DDoS-HOIC, DDoS LOIC-HTTP, Botnet, SSH-BruteForce, DoS-Hulk, DDoS-LOIC-UDP, and Infiltration.
To evaluate the performance of the proposed CHCG, we used accuracy (AC), precision (PR), recall (RC), and F1-score (F1) as the metrics.Concerning intrusion detection tasks, all results correspond to the following four outcomes: (1) TP (true positive): malicious samples correctly identified as malicious; (2) TN (true negative): benign samples correctly identified as benign; (3) FP (false positive): benign samples incorrectly identified as malicious; (4) FN (false negative): malicious samples incorrectly identified as benign.Therefore, the formulas for the above metrics are defined as follows.
These metrics are heavily influenced by data imbalances.Due to the high proportion of benign traffic, even with high accuracy, a significant number of benign samples may be incorrectly classified as malicious, which results in low precision.For attacks involving a small number of samples, they have little impact on the overall metrics.Therefore, macro-averages and weighted-averages of the above metrics were used to evaluate the overall detection performance.Unknown-averages were used to evaluate the detection effectiveness of unknown attacks.Taking precision as an example, assuming there are n m types of attacks and n u types of unknown attacks, and each type of attack has M i samples, macro-average precision, weighted-average precision, and unknown-average precision are calculated as follows: Each attack contributes equally to macro-averages and is not affected by the sample size, so classes with low scores have a greater impact.Weighted-averages increase the impact of sample size, while unknown-averages are similar but only for unknown attacks.

Detection Results on Different Datasets
We trained and tested the three models (different in the CNN module) separately on CIC-IDS2017 and CSE-CIC-IDS2018.We segmented the PCAP data at the IP pair granularity, and each sequence has 20 packets.In addition, detecting unknown attacks is a key challenge.We evaluated the method's ability to detect unknown attacks by evaluating its effectiveness in detecting attack classes that were not part of the training set.
The detection results for the CIC-IDS2017 dataset are presented in Table 4.We present the recall, accuracy, and F1-score.Recall reflects CHCG's detection ability for each attack category, accuracy measures the detection effect on all samples, and the F1-score combines precision and recall.The proposed method demonstrates an improved overall detection effect on CIC-IDS2017, achieving recall and accuracy rates above 99% for most categories.Specifically, CHCG exhibits high F1-scores for unknown attacks such as WebAttack-BruteForce, WebAttack-XSS, and WebAttack-SQL Injection, but it performs poorly in detecting Infiltration and Botnet.We find that the infiltration contains only 39 flows, each with a long duration and a high volume of packets.This increases the difficulty of distinguishing benign streams after the detection window is split.The utilization of ResNet18 drastically improved the detection of Infiltration and Botnet, achieving recall rates of 88.67% and 91.85%.This indicates that deeper CNN modules can augment the feature extraction ability and improve the accuracy of detection.The low F1-score can be attributed to the significant proportion of benign samples.Despite achieving an overall detection accuracy of 99.9%, the number of misclassified benign samples remains significant, which results in a low detection precision for specific attacks.The weighted F1-score, which takes into account the effect of the number of samples, is at a higher level (above 97% using ResNet18).
The detection results for the CSE-CIC-IDS2018 dataset are shown in Table 5.Except for SQL injection, CHCG shows a high F1-score for both known and unknown attacks.The weighted accuracy reaches 99.9%, indicating a strong detection capability for both benign and malicious samples.Unknown attacks can also be effectively detected.The models achieve a recall of 100%, which means that all unknown attack samples were detected, with an unknown-F1-score of up to 96.77%.The enhanced detection of unknown attacks in the experiments can be attributed to two primary factors.Firstly, there exist certain resemblances among DoS subclasses, which allow the attacks in the training set to contribute to acquiring knowledge about the characteristics of unknown attacks.Secondly, the introduction of contrastive loss in this study effectively segregates benign samples from malicious samples in the representation space.This process not only generates a more condensed cluster of benign representations but also ensures that representations of unknown samples are distanced from the core of the cluster, facilitating effective recognition.

Detection Results with Different Hyperparameters
There are two essential hyperparameters in data processing and training, the detection window size w and the scaling temperature factor τ.

Detection Results with Different Window Size W
The window size plays a crucial role in determining the frequency of detection; a smaller w results in faster detection frequency and reduced packet storage pressure on the IDS.However, this may lead to increased computational pressure on the model and difficulty in effectively representing some attacks.Conversely, a larger w leads to higher storage pressure and potential introduction of excessive noise, thereby increasing the challenge in representation.
We used a 2-Layer CNN + GRU model to test the performance on different window size of 10, 20, and 30, and CSE-CIC-IDS2018 was chosen as the dataset.Table 6 shows the test results.It can be seen that when w is 30, the detection performance of CHCG is the worst, and unknown-average-F1 is about 3% lower than w = 10.When the parameter w is set to 10 and 20, CHCG can obtain better detection results.Both known and unknown attacks can achieve high recall rates and F1-scores.Figure 5 illustrates the distribution of sample representation vectors visualized after PCA dimensionality reduction.When w is set to 30, BruteForce-XSS and BruteForce-Web show dispersion and are partially mixed with benign traffic, so a certain number of malicious samples are classified as benign.When w is taken as 10 and 20, there is a significant difference in the distribution of benign and attack traffic, making the detection of malicious traffic easy.Contrastive loss plays a key role in separating benign and malicious samples, and the large margin between the two makes it easy to detect unknown attacks as well.

Detection Results with Different τ
As indicated in preceding sections, the scaling factor serves the purpose of equilibrating the intra-class and inter-class distances [42].A smaller scaling factor leads to a quicker change in the contrastive loss, an increased emphasis on hard samples, a higher penalty for ambiguous samples near the boundary, and a finer delineation of the class boundary.A larger scaling factor results in a slower rate of change in the contrastive loss, a reduced penalty for ambiguous samples near the boundary, and a smoother class boundary.We tested the CHCG with temperature factors of 0.03, 0.05, 0.07, 0.1, 0.15, 0.2, and 0.5 on CSE-CIC-IDS2018.As shown in Table 7, the performance of the model remains relatively stable as τ increases, indicating that the proposed model has a strong representation capability.However, in the process of experiments, we observed that the distributions of the representations after dimensional reduction exhibit a clear trend.The clusters are sparsest at τ = 0.03 and become progressively more compact as τ increases, which is consistent with the findings of the work [42].We recommend setting τ to 0.1 for optimal performance.

Ablation Study
CHCG is structured as a hierarchical architecture combining CNN and GRU with an Embedding layer aimed at improving the encoding of protocol header bytes.This section tests the Embedding layer used and the hierarchical CNN and GRU structures.

Embedding Layer
Experiments were conducted on the CSE-CIC-IDS2018 dataset using consistent parameter configurations after excluding the Embedding layer.The results of the experiments are displayed in the Table 8.After removing the Embedding layer, the detection capability of some attacks is significantly reduced, including DDoS-LOIC-UDP, Brute Force-XSS, and Brute Force-Web.Unknown-average recalls decreased by 1.12%, and Weighted-average F1-score decreased by 1.54%.The experimental results show that the performance of CHCG is improved by the addition of the Embedding layer.
This shows that it is necessary to enhance the representation of protocol header bytes.Header bytes contain important flags, lengths, or other information, and it is unreasonable to simply treat a byte as a number in the range [0,255].In addition, there is no clear semantic connection between the values corresponding to bytes, and similar values do not necessarily mean that they represent similar information.For example, the two bytes representing port numbers, 80 and 81, are far apart semantically, although the difference is very small numerically.
The complexity of network protocols presents challenges in evaluating the Embedding layer.Other network protocols, such as UDP and ICMP, may not always meet the practical requirements for the number of embedding bytes.In addition, there are optional fields in the IP and TCP protocol headers, which contribute to the expansion of the protocol header bytes.The utilization of encryption protocols further increases the number of protocol header bytes, making it challenging to find a unified embedding approach.The evaluation carried out in this paper mainly uses a standard TCP protocol packet as a template.It selects 34 bytes for embedding, excluding the MAC address and IP address.However, there may be a certain degree of non-adaptation for other protocols.

The Hierarchical CNN and GRU Model
Experiments were conducted on the CSE-CIC-IDS2018 dataset using a consistent parameter configuration with the CNN or GRU module removed.We used the encoder of a vanilla autoencoder to replace the CNN module, which consists of two linear layers.The experimental results are displayed in the Table 9.After using linear layer instead of CNN, the recall rate of malicious traffic remains high, but the detection effect of DDoS LOIC in known attacks is reduced to some extent.In addition, the F1-score also decreased to a certain extent, indicating that there were more misscores for benign samples.The group with the GRU removed had a larger drop in recall for unknown attacks, down 2.5%.However, the decrease of F1-score is small, indicating that the detection ability of benign samples is more accurate.The experiment verifies the practicability of the CNN and gated cycle unit (GRU) in enhancing the feature description ability of the model.

Conclusions
In this paper, we propose CHCG, a contrastive learning-based end-to-end intrusion detection framework with hierarchical CNN and GRU networks to extract features from raw traffic.The network architecture combines hierarchical CNN and GRU to autonomously extract spatiotemporal features from sequences of packets, eliminating the need for manual feature engineering.The inclusion of an embedding layer enhances the encoding of protocol header bytes, as empirically demonstrated to significantly improve detection precision and F1 scores.Contrastive learning improves the model's capacity to encode benign network traffic into a compact cluster while also amplifying the differentiation between benign and malicious representations.This, in turn, aids in the expansion of the margins of the decision boundary.The proposed framework is assessed utilizing commonly employed datasets, namely, CIC-IDS2017 and CSE-CIC-IDS2018.The experiments demonstrate that our method can attain a detection accuracy of 99.9% for known attacks, thus achieving state-of-the-art performance.Our method also achieved a recall rate of over 95% for all unknown attack samples.Our future work will focus on achieving a fine-grained classification of malicious network traffic while also maintaining the capability to identify unknown attacks.

Figure 2 .
Figure 2.The packets generated by the PortScan attack are displayed at the granularity of flow and IP pair, respectively.Each flow starts with a TCP SYN packet and then receives an ACK or RST, which can also occur in benign flows.However, at the IP pair granularity, a large number of consecutive SYN packets appear within a short period of time, indicating a clear anomaly.

Figure 3 .
Figure 3. CSE-CIC-IDS2018 packet length distribution ( 2% random sampling).The x-axis coordinates represent intervals with packet lengths ranging up to 50.Each packet number bar contains the minimum value of the interval, not the maximum value.

Figure 4 .
Figure 4.The hierarchical model structure of CHCG.

Table 2 .
Details of windowed sequences in CIC-IDS2017 for w = 20.

Table 6 .
Detection results with different window size on CSE-CIC-IDS2018.

Table 8 .
Performance on CSE-CIC-IDS2018 with and without the Embedding layer.

Table 9 .
Results of CNN and GRU ablation experiments.