2.1. Network Intrusion Detection Systems
NIDSs aim to enhance a network’s cyber defence by monitoring traffic and classifying it as either authorised benign traffic or unauthorised malicious traffic, which can then be escalated to the CSOC. Traditionally, this has been achieved using signature-based detection systems, in which domain experts manually craft features that uniquely identify specific classes of malicious traffic. Potential attacks are detected by matching observed traffic against these predefined signatures. While this approach typically yields high-precision detection, the expert effort required to develop and maintain signatures renders such systems difficult to scale.
In response to the limitations of signature-based detection, supervised machine learning has emerged as a scalable solution for NIDSs. These methods leverage large labelled datasets of network traffic to infer decision boundaries between benign and malicious behaviour, enabling effective classification. Gradient-free models exemplify some of the simplest approaches to supervised machine learning in NIDSs. For example, decision trees (DTs) learn to classify traffic by recursively partitioning the feature space using interpretable, rule-based splits. Random Forests (RFs) extend this approach by aggregating multiple decision trees trained on randomised feature subsets, improving robustness and reducing overfitting. Support Vector Machines (SVMs) perform classification by learning a maximum-margin hyperplane that separates benign and malicious samples in a high-dimensional feature space. Collectively, these models have demonstrated effectiveness in intrusion detection tasks using tabular network traffic data.
Similarly, deep learning approaches have also found success in NIDSs. Initially, deep belief networks [
2] produced competitive performance. However, these were later supplanted by Multi-Layer Perceptrons (MLPs), which are now more commonly used. While alternative network architectures have been proposed, they are not well suited to tabular data. For example, Convolutional Neural Networks (CNNs) [
3] are sensitive to the order of tabular features in the input vector and do not model the relationship between all features in early layers. Furthermore, an equivalent network can be found using a regularised feedforward neural network. Similarly, Long Short-Term Memory (LSTM) networks, initially developed for time series data, have been employed in NIDSs [
4]. However, their inherent design for time series analysis means that tabular features must be processed sequentially, which can diminish performance and lead to inefficient training for tabular tasks.
While machine learning models have achieved state-of-the-art (SOTA) performance in network intrusion detection tasks and enabled more scalable NIDSs, they are not without limitations. In particular, both gradient-free and deep learning approaches typically rely on large volumes of labelled training data to achieve robust generalisation. This dependency introduces a critical vulnerability window in operational settings: newly deployed networks or emerging zero-day attacks lack sufficient historical data for model training, leaving systems exposed during the interval between network deployment or attack emergence and the accumulation of sufficient labelled data.
2.2. Anomaly Detection
In an attempt to overcome the reliance of supervised machine learning models on large labelled datasets, anomaly detectors instead aim to learn the distribution (or a distributional likelihood proxy) of benign traffic, allowing malicious traffic to be identified as outliers at test time. As they are trained exclusively on benign traffic during training, anomaly detectors can, in principle, detect both known and zero-day attacks without requiring prior knowledge of specific malicious classes.
Statistical anomaly detection leverages distance metrics or probabilistic models to identify deviations from normal network behaviour, thereby flagging potentially malicious traffic. Distance-based methods construct a reference vector from a set of tabular features using summary statistics such as the mean or median, or more generally a learned centroid or distribution. Test samples are then assigned an anomaly score based on their Minkowski distance from this reference vector [
5]. Extensions of this approach generalise to the Frobenius and Grassmannian distances [
6]. Moving away from global statistic, neighbourhood distance metrics such as nearest neighbour distance and local outlier factor assign anomaly scores by estimating local densities [
7]. Discriminative statistical methods can also be adapted for anomaly detection; for example, one-class SVMs [
8] and isolation forests [
9] represent the extensions of SVMs and tree-based models, respectively.
While classical statistical approaches are computationally efficient and interpretable, deep learning-based anomaly detectors are better suited to capturing complex, non-linear patterns in high-dimensional data. Reconstruction-based approaches train a neural network to reproduce its input from a compressed or corrupted representation, with the assumption that models trained on benign data will reconstruct normal traffic accurately, while large reconstruction errors will be produced for anomalous samples. Autoencoders implement this paradigm by learning a low-dimensional latent representation through a bottleneck layer [
10,
11]. Conversely, Sparse autoencoders favour the use of regularisation over a bottleneck layer [
12]. Variations of the autoencoder, such as DAE-LR [
13] and DUAD [
14], aim to improve anomaly detection performance through the use of various techniques such as additional regularisation techniques and iterative data filtering. However, DUAD is unlikely to be applicable to a few-shot learning task due to its reliance on iterative data filtering in a regime in which data is scarce.
Generative deep learning models such as Generative Adversarial Networks (GANs) [
15] and Variational Autoencoders (VAEs) [
16] aim to directly model the distribution of the benign input traffic. These learned distributions can then be employed for anomaly detection either by generating synthetic samples to train discriminative classifiers, or by approximating likelihood estimates under the learned distribution.
Finally, hybrid approaches aim to achieve performance gains by parameterising statistical techniques. Deep Gaussian Mixture Models (DAGMMs) [
17] combine an autoencoder with a Gaussian mixture model (GMM), where the autoencoder learns a low-dimensional representation of benign data and the GMM estimates the density of this representation to assign anomaly scores. Similarly, AutoSVMs [
18] apply a one-class SVM in the autoencoder’s representation space. The Deep Support Vector Data Descriptor (Deep SVDD) [
19] parameterises distance-based methods by learning a representation in which benign traffic is clustered around a centroid in representation space.
While anomaly detectors are able to detect malicious traffic without labelled training examples, they typically exhibit poor classification performance, resulting in false-positive rates which are far too high for use in practical systems [
20].
2.3. Contrastive Learning
Contrastive learning has recently emerged as a promising approach for mitigating the vulnerability window faced by network intrusion detection systems after the establishment of a new network or emergence of a zero-day attack, owing to its ability to train effective classifiers from limited labelled data. Unlike conventional supervised learning approaches that rely on absolute class labels, contrastive learning leverages relative similarity relationships between samples to learn discriminative feature representations that can generalise beyond the classes observed during training.
The simplest and most widely studied contrastive learning architecture is the Siamese network. As illustrated in
Figure 1, the Siamese network consists of two identical neural networks with shared weights that process a pair of input samples
, where
, in parallel.
Training relies on a binary similarity label,
, which indicates whether the pair of input samples are considered to be similar (
) or dissimilar (
). In supervised contrastive learning, samples of the same class distribution are regarded as similar, whilst sample pairs of differing class distributions are dissimilar. The contrastive loss function, defined in Equation (
1), minimises a distance metric,
for output dimensionality
, typically the Euclidean distance, between embeddings of similar sample pairs while maximising the distance metrics up to a margin,
, for dissimilar pairs. Here
is used to represent the output vectors (known as embeddings) of the neural network
; the parameters
are omitted for simplicity.
As the contrastive loss directly optimises a geometric distance between sample embeddings, the output of the Siamese network can be interpreted as a learned geometric embedding space. In this space, the relative positions of samples encode semantic similarity, where embeddings of samples drawn from the same class distribution are encouraged to lie close together, while embeddings of samples drawn from different class distributions are separated by at least a margin. In an optimal embedded space, the smallest inter-class distance is larger than the greatest intra-class distance.
Once the Siamese network is trained, the resulting embedding space can be exploited for downstream tasks such as classification or similarity-based inference by comparing the distance of test samples to training embeddings of each class. This approach has been found to be robust when the Siamese network is trained on a limited training dataset. Previous works have applied Siamese networks to network intrusion detection, including few-shot learning, demonstrating their potential for detecting previously unseen attacks [
21,
22]. However, these approaches relied on randomly sampled training and inference pairs, which can lead to slow convergence and sub-optimal classification performance. Subsequent work has explored alternative modalities, such as image-based representations of network traffic [
23], or other tasks such as zero-shot intrusion detection [
24].
Triplet networks process an anchor sample
, a positive sample
drawn from the same class distribution, and a negative sample
drawn from a different class distribution using identical networks with shared weights. Training is guided by the triplet loss function, which enforces that the anchor is closer to the positive sample than to the negative sample by at least a margin
. By focusing on relative distance relationships rather than absolute similarity, triplet networks impose a less restrictive learning criterion that better preserves intra-class variability while maintaining inter-class separation. This formulation has been shown to produce more flexible and discriminative embeddings in other domains, such as face recognition [
25].
A key limitation of Siamese networks in the context of network intrusion detection is their relatively strict learning constraint. By encouraging all samples from the same class to collapse toward a single point in the embedding space, Siamese networks may struggle to capture the significant intra-class variability present in network traffic, where benign behaviour and attack patterns often exhibit complex and partially overlapping distributions. In contrast, triplet networks optimise a relative distance constraint, requiring only that an anchor is closer to a positive sample than to a negative sample by a margin. This is a less restrictive objective, allowing the learned representation to preserve greater intraclass variability while still maintaining inter-class separation. This property is particularly beneficial in the few-shot intrusion detection setting considered in this work. Furthermore, the use of online triplet mining allows the model to exploit a larger number of informative training relationships than methods based on fixed pre-sampled pairs, which further contributes to improved performance.
In the context of intrusion detection, triplet-based contrastive learning has been explored for several adjacent purposes. RENOIR [
26] employed a triplet loss to accelerate convergence by constructing triplets using autoencoder reconstructions of benign and malicious traffic. Other work has used triplet losses for knowledge distillation, reducing model size while preserving classification performance for deployment in resource-constrained environments [
27]. These studies demonstrate the potential of triplet networks to improve representation learning efficiency and robustness in network security tasks.
More recently, a number of studies have explored few-shot learning for network intrusion detection using alternative representation learning and meta-learning paradigms. A mutual centralised learning framework has been proposed to model bidirectional relationships between support and query samples, demonstrating strong performance in both binary and multiclass settings [
28]. Prototypical capsule networks with attention mechanisms have been introduced to improve minority-class discrimination, while adaptive feature fusion strategies have been proposed to enhance representation quality in prototype-based few-shot learning [
29,
30]. In addition, model-agnostic meta-learning approaches have been applied to intrusion detection to enable rapid adaptation to new attack classes, and class-incremental few-shot learning frameworks have been explored to allow intrusion detection systems to continuously incorporate newly observed attack types [
31,
32].
These studies highlight the growing interest in few-shot and limited-data intrusion detection; however, they primarily focus on prototypical, meta-learning, or incremental learning formulations. In contrast, this work builds on prior work that applied Siamese networks to few-shot intrusion detection [
22], by extending the contrastive learning framework to triplet networks. Additionally, several improvements are proposed, including the use of online triplet mining and a KNN classifier. By leveraging the increased flexibility of triplet-based learning, the proposed approach aims to better capture the complex structure of network traffic data and improve generalisation performance when only limited labelled samples are available.