Fast and Accurate Multi-Task Learning for Encrypted Network Traffic Classification

: The classification of encrypted traffic plays a crucial role in network management and security. As encrypted network traffic becomes increasingly complicated and challenging to analyze, there is a growing need for more efficient and comprehensive analytical approaches. Our proposed method introduces a novel approach to network traffic classification, utilizing multi-task learning to simultaneously train multiple tasks within a single model. To validate the proposed method, we conducted experiments using the ISCX 2016 VPN/Non-VPN dataset, consisting of three tasks. The proposed method outperformed the majority of existing methods in classification with 99.29%, 97.38%, and 96.89% accuracy in three tasks (i.e., encapsulation, category, and application classification, respectively). The efficiency of the proposed method also demonstrated outstanding performance when compared to methods excluding lightweight models. The proposed approach demonstrates accurate and efficient multi-task classification on encrypted traffic.


Introduction
The advancement of science and technology and ultra-high-speed networks is accompanied by the rise of various applications.With the advancement of modern network technologies such as cloud computing and edge computing, research on efficient network management has been actively conducted.Among them, network traffic classification research is one of the key factors for efficient network management [1][2][3][4][5].
Traffic classification methods encompass traditional, signature-based, learning-based, and transformer-based approaches [3][4][5][6][7][8].Traditional methods rely on port-based and payload-based techniques.Port-based classification uses origin and destination ports, offering simplicity and low computational cost, but it faces limitations with dynamic ports.Payload-based classification utilizes fixed payload content, providing simplicity and high performance, but it is susceptible to encrypted traffic and struggles to adapt to new protocols [9].Signature-based methods classify traffic based on specific patterns or signatures, demonstrating high performance for defined signatures.However, they face challenges in adapting to changing patterns and encrypted traffic.Overall, network traffic classification research plays a key role in enhancing efficient network management amid evolving technological landscapes.
With recent advances in AI and technologies, most studies are using learning-based methods .Learning-based methods utilize machine learning (ML) and deep learning (DL) algorithms to learn and classify traffic.Models are trained on large amounts of traffic data to identify specific patterns or trends, which are then used to predict or classify new traffic.Due to these advantages, many studies have utilized learning-based methods, and they have improved performance in many areas.
Transformer-based methods are one of the more recent deep learning techniques to emerge, applying structures that have performed particularly well in natural language processing (NLP) for traffic classification [33][34][35][36].The self-attention mechanism of the transformer effectively learns the global dependencies of sequence data, which has shown promising performance in a variety of applications.For instance, the field of NLP has witnessed a notable advancement with the introduction of bidirectional encoder representation from transformers (BERT) pre-training models [35,36].BERT has demonstrated high performance in many fields and can be effectively applied to downstream tasks by learning relationships and structures for unbiased data from unlabeled data.In line with this trend, many studies have been conducted in the field of network traffic classification by applying transform-based methods.These methods have shown higher performance than traditional learning-based methods.
With the growing concerns regarding personal privacy and security, most applications now utilize encrypted traffic [37][38][39].As encrypted communications protect payload content, traditional traffic classification methods have become inapplicable.Researchers use publicly available encrypted traffic datasets such as ISCX 2016 VPN/Non-VPN [40].for encrypted traffic classification studies.In these encrypted traffic classification studies, public datasets are mainly divided into intrusion detection systems (IDS) and application classification, each of which is in turn divided into specific tasks.For example, the ISCX 2016 VPN/Non-VPN, which is often used for application classification studies, consists of three tasks: encapsulation, category, and application.
Traffic classification methods are categorized into single-task learning (STL) and multi-task learning (MTL) based on the target data task.STL focuses on training a model for a specific task in machine learning, enhancing performance by learning task-specific features and patterns.However, this optimized model may have limited applicability to other tasks.On the other hand, MTL involves training a model on multiple related tasks, utilizing shared representations to improve overall performance.MTL shares common low-level features across tasks while incorporating task-specific high-level features.This approach is valuable for diverse yet interrelated tasks, leading to more efficient and effective learning [41][42][43].
Most network traffic classification research has traditionally used STL, and while classification performance has improved, there are some limitations to applying traditional STL.First, the evolving complexity of networks, including intricate network traffic patterns, new network environments, applications, and encryption technologies, has challenged the applicability of traditional STL.Second, STL requires training a separate model for each task, which is time and resource intensive.Third, malicious activity on the network is becoming increasingly sophisticated.Attackers are adept at evading or defeating traditional security methods, requiring more detailed analysis that is more diverse and broader than traditional research.Therefore, it is essential to study traffic classification with MTL, which can address the limitations of traditional research by analyzing network traffic more comprehensively and in-depth compared to STL.
In this paper, we propose a multi-task classification method utilizing DistilBERT [36], a variant of the BERT model within a transformer architecture, for classifying encrypted traffic.This approach enables the performance of traffic classification for various tasks with a single training, using BERT.Our contributions can be summarized as follows: • We adopt a multi-task learning (MTL) approach for encrypted traffic classification, leveraging the DistilBERT model.The proposed method is based on a model that can handle multiple classification tasks simultaneously.The proposed method allows for a thorough and detailed analysis of encrypted network traffic, addressing the complexity of various tasks within a unified training framework.• To validate our proposed method, we conducted verification experiments, focusing on three specific tasks using the ISCX 2016 VPN/Non-VPN dataset.We compared our approach with other methods, assessing classification accuracy and efficiency.In terms of classification accuracy, we demonstrated average accuracies ranging from 96.89~99.29%across all tasks, outperforming the majority of existing methods.In terms of model efficiency, our approach showed favorable per sample processing time compared to existing models.Through our experiment results, we validate that our proposed method, employing multi-task classification for encrypted traffic, is effective in terms of both classification performance and efficiency.• We applied weight adjustments (class weight, task weight) within the model to solve the problems related to data imbalance and varying task difficulty.Through additional experiments, we validated the impact of both weights on performance improvement.This underscores the effectiveness of our approach in diverse scenarios, enhancing its applicability across various situations.
The remainder of this paper is organized as follows.In Section 2, we will describe the related work, and in Section 3, we will provide a detailed explanation of the proposed method.In Section 4, we conduct an experiment by using the ISCX 2016 VPN/Non-VPN dataset, including a multi-task classification experiment, and we will discuss several issues in Section 5. Finally, we conclude the paper and outline future research directions in Section 6.

Overview of the Network Traffic Classification
Network traffic classification research is the study of analyzing the traffic generated by computer communications, which is essential for the effective management, monitoring, and security of computer networks.As shown in Figure 1, network traffic classification research is broadly classified according to the field of research, methodology, classification level, and data units processed.
Appl.Sci.2024, 14, x FOR PEER REVIEW 3 of 23 terms of model efficiency, our approach showed favorable per sample processing time compared to existing models.Through our experiment results, we validate that our proposed method, employing multi-task classification for encrypted traffic, is effective in terms of both classification performance and efficiency.


We applied weight adjustments (class weight, task weight) within the model to solve the problems related to data imbalance and varying task difficulty.Through additional experiments, we validated the impact of both weights on performance improvement.This underscores the effectiveness of our approach in diverse scenarios, enhancing its applicability across various situations.
The remainder of this paper is organized as follows.In Section 2, we will describe the related work, and in Section 3, we will provide a detailed explanation of the proposed method.In Section 4, we conduct an experiment by using the ISCX 2016 VPN/Non-VPN dataset, including a multi-task classification experiment, and we will discuss several issues in Section 5. Finally, we conclude the paper and outline future research directions in Section 6.

Overview of the Network Traffic Classification
Network traffic classification research is the study of analyzing the traffic generated by computer communications, which is essential for the effective management, monitoring, and security of computer networks.As shown in Figure 1, network traffic classification research is broadly classified according to the field of research, methodology, classification level, and data units processed.First, in terms of research areas, it consists of various subfields, including application classification [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], malicious traffic detection [29][30][31][32], user behavior profiling [27][28][29][30], and web fingerprinting [44][45][46], of which application classification and malicious traffic detection are the most widely studied.Second, in terms of methodologies, methods such as port-based and payload-based methods have traditionally been widely used.Portbased classification categorizes traffic based on known port numbers, which is inapplicable because many applications use dynamic ports.Payload-based methods classify applications based on fixed payload content.Signature-based methods extend the mechanisms of payload-based methods to various traffic characteristics, defining common statistical, header, and behavioral characteristics of traffic as signatures and classifying based on them.Both payload-based and signature-based methods perform poorly on encrypted traffic.To solve these limitations, learning-based methods using machine learning and deep learning are the most active, and recently, methods using transformer models have also been performed.Third, in terms of classification level, it consists of the following First, in terms of research areas, it consists of various subfields, including application classification [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], malicious traffic detection [29][30][31][32], user behavior profiling [27][28][29][30], and web fingerprinting [44][45][46], of which application classification and malicious traffic detection are the most widely studied.Second, in terms of methodologies, methods such as port-based and payload-based methods have traditionally been widely used.Portbased classification categorizes traffic based on known port numbers, which is inapplicable because many applications use dynamic ports.Payload-based methods classify applications based on fixed payload content.Signature-based methods extend the mechanisms of payload-based methods to various traffic characteristics, defining common statistical, header, and behavioral characteristics of traffic as signatures and classifying based on them.Both payload-based and signature-based methods perform poorly on encrypted traffic.To solve these limitations, learning-based methods using machine learning and deep learning are the most active, and recently, methods using transformer models have also been performed.Third, in terms of classification level, it consists of the following levels: application classification, which distinguishes each application; service classification, which categorizes the detailed features, services, and behaviors of the application; application type classification, which categorizes the characteristics of the application such as Chat or File Transfer; and encryption classification, which categorizes the presence or absence of encryption.Fourth, in terms of data units, it is categorized into unidirectional and bidirectional flows, packets, and bursts.A flow is a set of packets with the same 5-tuples of information in the packet header, and a burst is a set of time-adjacent network packets originating from either the request or the response in a single-session flow [34].
As mentioned before, we propose a multi-task classification method for encrypted traffic using DistilBERT to perform encapsulation, application type, and application classification on ISCX 2016 VPN/Non-VPN data.In Figure 1, the green-colored parts represent the four aspects of our proposed method.
In [10], Lotfollahi et al. introduced Deep Packet, a system that utilizes a stacked autoencoder and CNN.They achieved an impressive F1 score of 98% for application identification on the ISCX 2016 VPN/Non-VPN dataset.Wang et al. [11] introduced a novel method to convert packets into images and process them using 1D-CNN, which showed promising results on ISCX 2016 VPN/Non-VPN.In [12], Zou et al. pre-sent an encrypted network traffic classification approach using CNNs and LSTM networks; in [13], they proposed an innovative fusion of CNNs and designed RNNs for service recognition in IoT traffic; in [14], they used naïve Bayes, C4. 5 decision trees, Bayesian networks, and naive Bayes trees.They performed a comprehensive analysis comparing the performance of these algorithms using 22 features extracted from network flows.In [15], they introduced flow sequence network (FS-NET) for encrypted traffic classification.FS-NET utilizes both RNNs and a multi-layer encoder-decoder structure.In [16], the authors proposed FlowPic, a classification method that converts consecutive packet sizes in a flow into a two-dimensional gray image and uses CNNs for classification.While FlowPic is simple and performs well, it is not suitable for real-time traffic classification because it requires the capture of traffic over a long period of time.The authors also note that it is not applicable to classifying some encrypted traffic.In [17], the authors proposed TSCRNN, which automatically extracts features for efficient traffic classification based on spatiotemporal features.To validate the proposed method, the authors conducted experiments on ISCX Tor 2016 data and obtained high accuracy.In [18], the authors proposed MIMETIC, which exploits traffic data heterogeneity by learning both intra-and inter-modality dependencies to overcome performance limitations.MIMETIC outperforms single-modality DL-based, state-of-the-art ML-based mobile traffic classifiers.In [19], the authors propose an improved DAGSVM classification method by focusing on the error accumulation of the traditional DAGSVM algorithm.Experimental results show that the proposed method has higher classification accuracy than traditional DAGSVM while having an acceptable time cost.The studies in [39] and [47] have conducted research with a focus on lightweight models rather than classification performance.While most studies primarily emphasize performance, they highlight the importance of lightweight approaches for handling large-scale traffic data.
In recent years, there has been a surge in research centered on transformer architectures characterized by self-attention and multi-headed attention mechanisms.Transformerstructured models mainly utilize the BERT model, which has proven to show strong performance in the NLP field, but recently, research has also been conducted using the masked autoencoder (MAE), which is used in the CV field [33][34][35].
In [34], the authors proposed ET-BERT, a novel approach inspired by transformer architectures.It presents a new pre-training method designed for encrypted traffic classification and fine-tuned for optimal performance achieving an accuracy of over 97%.In [21], the authors propose a method called PERT (payload encoding representation from transformer) utilizing dynamic word embedding.PERT outperforms other methodologies on publicly available encrypted traffic datasets and captures Android HTTPS traffic.In [22], the authors propose the BFCN model, which combines BERT and CNN models to derive global traffic features with a pre-trained BERT model and byte-level local traffic features with a CNN model.The experimental results show F1 scores of 99.11% and 99.41% in the traffic service and application identification tasks operating on the ISCX 2016 VPN/Non-VPN dataset, respectively.In [23], similar to [22], a pre-trained BERT model and a bidirectional LSTM are applied together, with an accuracy of about 99%.In [33], the authors utilize DistilBERT to perform encrypted traffic classification research.They introduce comparative learning to enhance classification speed without degrading performance.Although our study is similar to [33], which focuses on STL, our study specifically targets MTL.We apply MTL to simultaneously learn three tasks on a single model, resulting in superior performance.
In [24,25], both studies utilize MAE for traffic classification research.The authors propose a pre-training model for MAE that introduces a mask patch model, a self-supervised learning pre-training task, to capture unbiased representations from bursts of varying lengths and patterns.Experiment results show that the proposed system achieves new high levels of accuracy of 98%, classification speed, memory efficiency, and robustness across a wide range of network traffic types.

Overview of the Multi-Task Learning
The advent of deep learning has led to significant performance improvements in CV and NLP, as well as network traffic classification.The typical approach is to learn these tasks in isolation, where a separate neural network is trained for each individual task [15][16][17][18][19][20][21][22][23][24][25].Nevertheless, deep learning-based methods suffer from a number of limitations in terms of time and memory.Recently, research has been conducted on MTL techniques, which have shown promising results in terms of performance, computational, and/or memory efficiency [41][42][43].MTL is the joint handling of multiple tasks through a learned shared representation.In [41], the author introduces hard parameter sharing and soft parameter sharing and discusses techniques such as deep relationship networks and fully adaptive feature sharing.In [42], the authors investigate various aspects of MTL.First, we provide a definition of MTL, and then we categorize supervised MTL models into five main approaches and discuss their characteristics The authors note that outlier tasks that are unrelated to other tasks are known to degrade the performance of all tasks when learning collaboratively, and they present this as a challenge.In [43], the authors present an overview of architectural and optimization-based strategies for MTL within the scope of deep neural networks.They also introduce how to set weights for each task in an MTL.In summary, MTL leverages useful information from multiple related tasks with the goal of improving the generalization performance of any task.MTL is efficient in terms of performance, time, and memory as it can handle multiple tasks using a single model.However, it is important to consider the correlation between tasks, the structure of the model, and optimization because certain tasks can degrade the performance of others.
With the rising interest in MTL, there is a gradual increase in research applying MTL to traffic classification studies [48][49][50].In [48], the authors claim to be the first to apply MTL in network traffic classification research and utilize CNNs to perform malware detection.In [49], the authors employ three time-series features and utilize CNN for multi-task classification on QUIC and ISCX 2016 VPN/Non-VPN datasets.However, the detection performance appears relatively low with an accuracy range of 82-92%.The classification task is configured slightly differently compared to previous studies.In [50], the authors perform multi-task classification using transformer and 1D-CNN, achieving an accuracy of 97-98% on the ISCX 2016 VPN/Non-VPN dataset.Our work is similar to their work.
Their study is similar to ours, but we demonstrate accuracy exceeding 99% across all three tasks.Additionally, we evaluate the efficiency of multi-task classification, an aspect not addressed in their work.

Model Architecture
The entire system structure consists of three sub-systems (i.e., data preprocessing, byte tokenizing, and multi-task classification) and is shown in Figure 2. Data preprocessing is the process of converting raw traffic data into an input format before applying it to DistilBERT model, resulting in byte-separated data as the output.Byte tokenizing takes the data from the previous module as the input and performs tokenization for each byte.Multi-task classification takes the tokenized data as the input, performs embedding, runs it through the DistilBERT model, and predicts a label for each task.
Appl.Sci.2024, 14, x FOR PEER REVIEW 6 of 23 is configured slightly differently compared to previous studies.In [50], the authors perform multi-task classification using transformer and 1D-CNN, achieving an accuracy of 97-98% on the ISCX 2016 VPN/Non-VPN dataset.Our work is similar to their work.Their study is similar to ours, but we demonstrate accuracy exceeding 99% across all three tasks.Additionally, we evaluate the efficiency of multi-task classification, an aspect not addressed in their work.

Model Architecture
The entire system structure consists of three sub-systems (i.e., data preprocessing, byte tokenizing, and multi-task classification) and is shown in Figure 2. Data preprocessing is the process of converting raw traffic data into an input format before applying it to DistilBERT model, resulting in byte-separated data as the output.Byte tokenizing takes the data from the previous module as the input and performs tokenization for each byte.Multi-task classification takes the tokenized data as the input, performs embedding, runs it through the DistilBERT model, and predicts a label for each task.

Data Preprocessing
(1) Target Dataset: While there have been many publicly available network traffic datasets for a long time, encrypted traffic datasets are the most common.There are several encrypted traffic datasets available, but we use the ISCX 2016 VPN/Non-VPN dataset [40], which is the most popular in this research area.This dataset is captured from real traffic and is a publicly available dataset in raw pcap format consisting of traffic from various applications.Since it is the most popular dataset used in several previous studies, it allows for the comparison and interpretation of experimental results from multiple studies.The dataset is broadly categorized into three classes (i.e., encapsulation, category, and application), and separate classification studies are typically performed for each label.Table 1 shows information about the classes for each task.

Data Preprocessing
(1) Target Dataset: While there have been many publicly available network traffic datasets for a long time, encrypted traffic datasets are the most common.There are several encrypted traffic datasets available, but we use the ISCX 2016 VPN/Non-VPN dataset [40], which is the most popular in this research area.This dataset is captured from real traffic and is a publicly available dataset in raw pcap format consisting of traffic from various applications.Since it is the most popular dataset used in several previous studies, it allows for the comparison and interpretation of experimental results from multiple studies.The dataset is broadly categorized into three classes (i.e., encapsulation, category, and application), and separate classification studies are typically performed for each label.Table 1 shows information about the classes for each task.Encapsulation refers to the presence or absence of encryption on the target traffic and consists of two classes: VPN and Non-VPN.Category refers to the nature of the application and consists of six classes, excluding web browsing.Application indicates the application used and consists of sixteen classes.(2) Preprocessing: We perform the following preprocessing.First, we convert the packetlevel pcap file to flow-level.We segment the capture files into bidirectional flows using the SplitCap tool.Second, we remove irrelevant flows from the converted flow file.
The ISCX 2016 VPN/Non-VPN dataset contains approximately 309 K flows in total.However, as noted in [51], the dataset contains a lot of irrelevant flows.For example, it also includes traffic that is not application-specific, such as NBSS, LLMNR, DNS, etc. and the disrupted three-way handshake flows.Through the preprocessing steps outlined in [51], a total of 29,195 flows were identified.We performed further analysis and found that there were specific flows within these flows, characterized by UDP, a destination IP of 255.255.255.255, and a consistent inclusion of the string "Beacon~" in the payload.These flows were considered non-essential for the research objectives; therefore, we removed these unnecessary flows from the converted flow data.After going through the first and second process, we finally obtained 8763 flows.Third, we performed zero-padding and flow splicing from the converted data.Considering the subsequent byte tokenization process, we extract 63 bytes from each of the eight packets in the flow.In this process, if the number of bytes in a packet is less than 63, we perform zero-padding.If the packet has more than 63 bytes, we perform splicing.
Based on other research [33,34] and experiments under various configurations, we chose 63 as the optimal byte value.The 63 bytes are composed of (1) IP, (2) TCP or UDP, and (3) Payload, depending on the network layer and data.In this case, the IP has the same number of bytes at 20 bytes, but the lengths of the headers for TCP and UDP are 20 and 8 bytes, respectively, so the length of the payload that comes after it will be different.Therefore, the UDP header is extended to 20 bytes by using zero-padding at the end.We also perform zero-padding for flows that are less than 63 bytes in length for the entire flow, and in the case of UDP, additional padding is performed for the UDP header.Finally, we remove the Ethernet header and, masking the IP, port to zero.These are masked as it can cause biased interpolation as it has strong identifying information.Figure 3 shows the distribution of bidirectional flows by class for pre-processed data.In Figure 3, we can see that the three tasks suffer from data imbalance between each class, which we address in Section 3.2.1.
bytes in length for the entire flow, and in the case of UDP, additional padding i formed for the UDP header.Finally, we remove the Ethernet header and, ma the IP, port to zero.These are masked as it can cause biased interpolation as strong identifying information.Figure 3 shows the distribution of bidirectional by class for pre-processed data.In Figure 3, we can see that the three tasks suffer data imbalance between each class, which we address in Section 3.

Byte Tokenizing
Byte tokenizing is the process of separating preprocessed data into bytes and verting the separated bytes into tokens.There are two parts to this process: First, we the preprocessed data into bytes to use as the input.Second, the process of convertin extracted bytes of data into tokens is performed.In this process, it is crucial to deter the number of tokens to be used for organizing the data.If the number of tokens high, it may increase the data processing load, while too few tokens can result in th of essential information for classification, leading to performance degradation.Add ally, considering that BERT can handle a maximum of 512 tokens, selecting an approp number of tokens is essential.After experimenting with various combinations, we mately chose 63 bytes for the first eight packets, which can accommodate a total o tokens, including two special tokens [CLS] and [SEP].We present a performance com

Byte Tokenizing
Byte tokenizing is the process of separating preprocessed data into bytes and converting the separated bytes into tokens.There are two parts to this process: First, we split the preprocessed data into bytes to use as the input.Second, the process of converting the extracted bytes of data into tokens is performed.In this process, it is crucial to determine the number of tokens to be used for organizing the data.If the number of tokens is too high, it may increase the data processing load, while too few tokens can result in the loss of essential information for classification, leading to performance degradation.Additionally, considering that BERT can handle a maximum of 512 tokens, selecting an appropriate number of tokens is essential.After experimenting with various combinations, we ultimately chose 63 bytes for the first eight packets, which can accommodate a total of 506 tokens, including two special tokens [CLS] and [SEP].We present a performance comparison based on input shape in Section 5.3.

Multi-Task Classification
BERT is an NLP model that utilizes a transformer-based architecture and excels in bidirectionally understanding context within sentences.It encompasses two phases: pre-training and fine-tuning.In the pre-training stage, BERT undergoes immersion in extensive amounts of unlabeled data.This process involves two phases: next sentence prediction (NSP) and masked language modeling (MLM).In the NSP phase, the model learns to predict whether a sentence follows another sentence in the input text, enhancing its grasp of discourse-level context.In the MLM phase, certain words in the input sentences are randomly masked, and model is trained to predict these masked words, fostering a bidirectional understanding of context at the word level.In the fine-tuning phase, the pre-trained BERT model is further refined for specific tasks, such as text classification or question answering, optimizing the process for each task.In network traffic classification research, a large amount of unlabeled traffic is collected in a pre-training phase to learn the structure and relationships within the traffic.Each downstream classification task is then performed in a fine-tuning phase.In [33], pre-training was performed using about 30 GB of unlabeled traffic data, and five tests were performed with fine-tuning.
Our proposed method does not utilize an additional pre-training model and directly uses the fine-tuning model of DistilBERT.This is because in the field of network traffic classification, the pre-training process has several limitations.First, the traffic structure is very diverse and extensive, but the input dimensions of the BERT model are limited.Second, the temporal and spatial features in the packet header are ignored, resulting in performance degradation.These limitations make it difficult for the model to fully learn the characteristics of different network traffic.Third, the pre-training process is computationally intensive, requiring substantial time, memory overhead, and high-performance hardware due to the utilization of extensive traffic data.In addition, we perform byte-level tokenizing as in [51].As the authors of [51] note, the values derived from the previous traffic preprocessing and byte tokenizing are represented as integers between 0 and 255, allowing us to directly fine tune the DistilBERT [36] model, which is explicitly provided as "distilbert-base-uncased".
The output layer uses [CLS] as the final sequence representation for downstream task classification.The [CLS] token output may be converted into a class probability based on the task.MTL predicts multiple task labels from [CLS] tokens, with approaches such as hard parameter sharing (tasks share all parameters) and soft parameter sharing (tasks have their own parameters, sharing some).Hard parameter sharing is efficient with shared parameters, suitable for related tasks, while soft parameter sharing allows task specialization for tasks with diverse characteristics.
Therefore, it is important to consider the relevance and nature of the task within the target dataset and choose the appropriate method.As mentioned before, we target three different tasks in the ISCX 2016 VPN/Non-VPN dataset, and all three tasks are related to each other as they perform task-specific classification on the same data.Therefore, we utilized the hard parameter sharing for MTL, and Figure 4 shows the proposed MTL structure.
Figure 4 is organized into shared layers and task specific layers, where the model and different parameter sets are shared in the shared layer, and the task-specific layers are used to classify and derive results for each task.The shared layers include the embedding layer and the transformer encoding layer used by the DistilBERT model.specialization for tasks with diverse characteristics.
Therefore, it is important to consider the relevance and nature of the task within the target dataset and choose the appropriate method.As mentioned before, we target three different tasks in the ISCX 2016 VPN/Non-VPN dataset, and all three tasks are related to each other as they perform task-specific classification on the same data.Therefore, we utilized the hard parameter sharing for MTL, and Figure 4 shows the proposed MTL structure.

Weight Adjustment 3.2.1. Class Weight for Imbalanced Data
As shown in Figure 3, the data are heavily imbalanced.Data imbalance stands as a significant challenge constraining the performance of ML models, particularly when the samples of the minority class are insufficient [52,53].To address this issue, common practices involve the utilization of undersampling and oversampling techniques.However, these methods come with risks of underfitting and overfitting, respectively, potentially limiting the generalization ability of the model.
In recent research, weighted classes have been recognized as one approach to addressing data imbalance [33].Weighted classes can significantly reduce the bias in the data; thus, we utilize a method for calculating class weights.Equation (1) indicates the method for calculating the normalized weights for each class.In Equation (1), W ki is the weight for each class in task k, C ki is the number of samples for each class label within the k tasks, k indicates target task, and j indicates class label.These weights are utilized to adjust the training of the model, taking into consideration the imbalance within each class, thereby aiding in enhancing the overall model performance.

Task Weight for Loss Calculation
In a typical DL, loss is a metric that represents the difference between the model's predictions and the actual target.Minimizing this difference allows the model to learn the desired outcome more effectively.Loss is often calculated through an objective function (loss function), most commonly the cross-entropy, mean squared error, etc.In multi-task classification, the loss is different for each task, so it is necessary to calculate the loss for each task step by step and combine them effectively to obtain the final loss.Equations ( 2) and (3) indicate the method for accumulating losses in multi-task classification.In Equation (2), y′ i is the model's predicted value, y i is the actual value, and f i is the objective function for task i.After calculating the loss for each task, they are combined to obtain the final loss.In Equation (3), Total Loss is the final loss, which is the aggregate of the losses from each task, N is the number of tasks, and α i is a weight that represents the relative importance of each task.
In MTL, performance and learning time can vary due to differences in the difficulty of each task.Typically, easier tasks converge quickly to achieve high accuracy, while more difficult tasks face complications in convergence and require more extensive training.Allocating equal weights to all tasks in MTL may not be appropriate, as it could lead to higher weights for easier tasks, diminishing the model's learning capacity for difficult tasks.Therefore, in MTL, it is essential to consider the difficulty of each task and assign appropriate weights.Equation (4) illustrates a method for determining the weights for each task in light of their respective difficulties.
In Equation ( 4), E i represents the minimum number of epochs required to converge to performance β. β is measured by accuracy and can be dynamically adjusted.However, continuous weight adjustments may decrease the model's stability and increase the risk of overfitting to specific tasks.Therefore, we set β to 90% through various experiments.For example, assuming that there are four tasks and it takes 5 epochs in task #1, 10 epochs in task #2, 15 epochs in task #3, and 20 epochs in task #4 to achieve 90% accuracy each, the weights are set to 0.1 (5/50), 0.2 (10/50), 0.3 (15/50), and 0.4 (20/50), respectively.

Evaluation Environment Setup
The proposed method was implemented using Python 3.10.9and PyTorch 2.0.1 with CUDA 11.8.All experiments were performed on a Linux Ubuntu 20.04.6 LTS server with a 24-core Intel(R) Core(TM) i9-10920X CPU (3.50 GHz) and NVIDIA GeForce RTX 4090 GPU (24 GB memory).We set the optimal parameters for the model through various experiments.We set the learning rate to 2 × 10 −5 , the batch size to 16, and the dropout ratio to 0.1 and used AdamW as the optimization tool.Each dataset is divided into the training set and the testing set according to the ratio of 7:3.We randomly selected 500 samples from each task (6 categories, 16 applications in total) and entered them into the dataset; however, if the number of samples for some applications (e.g., Gmail, SFTP within an application classification) was less than 500, we selected all samples for that application.

Evaluation Metrics
When evaluating the performance of a model, the evaluation metrics are important.We utilized four evaluation metrics that have been used in several studies: accuracy, recall, precision, and F1 score.Equations ( 5)- (8) show the method for calculating these metrics True positive (TP) is when the model correctly classifies something as positive, and true negative (TN) is when the model correctly classifies something as negative.False positive (FP) is when the model incorrectly classifies something as positive when it was negative, and false negative (FN) is when the model incorrectly classifies something as negative when it was positive.
As previously mentioned, the ISCX 2016 VPN/Non-VPN data are highly imbalanced between classes.To account for the potential bias in the results due to the imbalance between the different categories of data, we used macro average [36].Macro average calculates the average value of precision, recall, accuracy, and F1 scores for each category to provide a more comprehensive and unbiased assessment across all categories.

Evaluation Result
In this section, we describe our experiments and results to validate the proposed method.We present the classification performance of our proposed model in Section 4.3.1 and conduct a performance comparison with other models in Section 4.3.2.We validate the efficiency of our proposed method in Section 4.3.3 and describe several discussions in Section 5.
Figure 5 illustrates the confusion matrix detailing accuracy for each task.In subfigures (a), (b), and (c), the confusion matrix is presented for each task.In Figure 5, while the majority of classes within each task demonstrate a high accuracy exceeding 95%, AimChat and ICQChat in Figure 5c exhibit relatively lower accuracy.These applications, designed for online chatting and offering various services like voice and video calls, share common traits.However, the similarities between these applications make it difficult to distinguish traffic patterns accurately, leading to decreased classification accuracy.The intricacies of these chat applications contribute to the difficulty in achieving higher performance.
Figure 5 illustrates the confusion matrix detailing accuracy for each task.In subfigures (a), (b), and (c), the confusion matrix is presented for each task.In Figure 5, while the majority of classes within each task demonstrate a high accuracy exceeding 95%, AimChat and ICQChat in Figure 5c exhibit relatively lower accuracy.These applications, designed for online chatting and offering various services like voice and video calls, share common traits.However, the similarities between these applications make it difficult to distinguish traffic patterns accurately, leading to decreased classification accuracy.The intricacies of these chat applications contribute to the difficulty in achieving higher performance.Figure 6 shows the learning curve for the three tasks in training and testing.In Figure 6, the losses represent the total losses for the three tasks, with the learning and testing losses gradually decreasing.

Comparison with Other Model
To validate the performance of our proposed method, we compare its performance with various state-of-the-art methods in network-encrypted traffic classification.For accurate performance validation, it is essential to compare methodologies using the same dataset with identical preprocessing methods in a consistent environment.However, direct comparisons of different methodologies are often impractical due to various constraints.Therefore, we took the performance presented by each methodology and used them for the comparison.The methods are categorized into the following: (1) statistical feature-based, (2) ML-and DL-based, and (3) pretraining-based, and a total of 17 methodologies are compared.
The proposed method achieves about 96~98% accuracy on tasks #2 and #3, outperforming most of the existing research methods.Although several methodologies exhibit slightly better performance (i.e., accuracy 0.69-1.67% in task #2 and accuracy 1.31-2.98% in task #3), it is noteworthy that the existing approaches are designed for STL-based single task classification, while the proposed method is capable of classifying three tasks simultaneously.This capability to address multiple tasks simultaneously is remarkable.Through this multi-task classification, the proposed method not only maintains high performance but also proves to be more efficient than conventional approaches in handling the classification of multiple tasks concurrently.

Comparison with Other Model
To validate the performance of our proposed method, we compare its performance with various state-of-the-art methods in network-encrypted traffic classification.For accurate performance validation, it is essential to compare methodologies using the same dataset with identical preprocessing methods in a consistent environment.However, direct comparisons of different methodologies are often impractical due to various constraints.Therefore, we took the performance presented by each methodology and used them for the comparison.The methods are categorized into the following: (1) statistical featurebased, (2) ML-and DL-based, and (3) pretraining-based, and a total of 17 methodologies are compared.

Performance of the Efficiency
The proposed method utilizes MTL to perform multi-task classification on the ISCX VPN/Non-VPN 2016 dataset.The goal is to achieve high performance by simultaneously handling various classification tasks.The efficiency of the model refers to its ability to quickly adapt to downstream tasks.To evaluate this efficiency, we compared the proposed method with other approaches and measured the processing speed.However, the interpretation of the model's efficiency may vary depending on hardware performance and data.Therefore, maintaining the same experimental environment and dataset is crucial for a fair comparison.Since it is difficult to reproduce these conditions exactly, we compare our results to those presented in other studies [8,33,34,39].Table 5 shows the results on fine-tune efficiency evaluation.In Table 5, ST represents the processing time for one sample and PT represents the processing time for one packet.Among the four models, ET-BERT and XENTC are general classification models, while MATEC and FastTraffic are models designed for lightweight purposes.All of them perform single-task classification.
From an ST perspective, ET-BERT yields a range of 8.30~9.61ms.The lightweight models, MATEC and FastTraffic, yield 2.10 and 0.25 ms, respectively.The proposed method achieves higher efficiency than ET-BERT with an execution time of 1.27 ms, but it is less efficient than MATEC and FastTraffic.Nevertheless, considering the results in Tables 3 and 4, the proposed method demonstrates 4.65~27.68%higher accuracy compared to MATEC and FastTraffic.Furthermore, the proposed method is more efficient than MATEC as it can learn the three tasks simultaneously.
From a PT perspective, ET-BERT yields 155.7 ms, XENTC produces 15.1 ms, MATEC results in 1.3 ms, and FastTraffic yields 0.59 ms.The proposed method achieves an efficiency of 30.7, which is higher than ET-BERT but lower than XENTC and FastTraffic.The proposed method seems to exhibit relatively high PT since it processes eight packets within the flow.However, similar to the ST perspective, the proposed method demonstrates high efficiency considering both accuracy and multi-task classification.
The efficiency of a model is highly influenced by hardware performance and data structure, and there is typically a trade-off between model performance and efficiency.Considering this trade-off, evaluating the balance between model performance and efficiency becomes crucial.Further discussion is needed based on additional experimental results to better understand this trade-off and assess the overall performance and efficiency of the model.Therefore, in the future, we plan to enhance the model to achieve higher efficiency while maintaining its classification performance.

Discussions
In this paper, we demonstrate high performance and efficiency by performing multitask classification on encrypted traffic.In this section, we provide some detailed discussion of the proposed method.

Effect of Class Wight in Data Imbalance
We applied class weights to address the class imbalance in ISCX 2016 VPN/Non-VPN data in Section 3.2.1.Class weight represents a weight that reflects the proportion of classes in the data and can reduce the imbalance between classes.Figure 7 shows the distribution of data for each class before and after applying class weight.

Effect of Class Wight in Data Imbalance
We applied class weights to address the class imbalance in ISCX 2016 VPN/Non-VPN data in Section 3.2.1.Class weight represents a weight that reflects the proportion of classes in the data and can reduce the imbalance between classes.Figure 7 shows the distribution of data for each class before and after applying class weight.In Figure 7, the x-axis represents the classes per task and is organized the same as in Figure 3.For example, in Figure 7a, "1" and "2" represent {VPN, NonVPN}, respectively, and in Figure 7b, "1~6" represent {Chat, Email, File Transfer, P2P, Streaming, VoIP}.Comparing the 'Before' and 'After' in Figure 7, we can see that the imbalance between each class is significantly reduced.However, in Figure 7c, we can see that there is still some imbalance as there are too few minority classes.These limitations will be tackled in the future with additional weighting and sampling techniques.

Performance Based on Weight Adjustment
In order to address both data imbalance issues and variations in difficulty across tasks, we applied weight adjustments during the experiments.The weight adjustment is implemented in two aspects: class weights and task weights.Class weights were In Figure 7, the x-axis represents the classes per task and is organized the same as in Figure 3.For example, in Figure 7a, "1" and "2" represent {VPN, NonVPN}, respectively, and in Figure 7b, "1~6" represent {Chat, Email, File Transfer, P2P, Streaming, VoIP}.Comparing the 'Before' and 'After' in Figure 7, we can see that the imbalance between each class is significantly reduced.However, in Figure 7c, we can see that there is still some imbalance as there are too few minority classes.These limitations will be tackled in the future with additional weighting and sampling techniques.

Performance Based on Weight Adjustment
In order to address both data imbalance issues and variations in difficulty across tasks, we applied weight adjustments during the experiments.The weight adjustment is implemented in two aspects: class weights and task weights.Class weights were introduced to mitigate data imbalance problems, while task weights were designed to prevent biased learning, particularly when there were significant differences in difficulty among tasks.The proper utilization of these two weights is crucial, especially in scenarios where specific tasks converge rapidly; failure to handle this appropriately may lead to biased learning.Figure 8 shows the test accuracy curve for the weight adjustment.
gence speeds across each task, later showing higher performance in task #2 and #3 compared to task #1. Figure 8d shows the results with weight adjustments, applying class and task weights, also yielding an accuracy of 98~99% across task #1 and 96-97% across tasks #2 and #3.Through the above experiments, it is evident that adjusting weights for both categories leads to higher performance.Therefore, it can be concluded that weight adjustment plays a crucial role in enhancing performance.

Performance Based on Input Shape
In Section 3.1.1,we described that we conducted several experiments with various input shapes based on the number of packets and bytes in the flow.Through these experiments, we set the optimal shape as 8 packets and 63 bytes.In this section, we compare Figure 8a shows the results with weight adjustments, applying only class weight, yielding an accuracy of 99% across task #1, 90~91% across task #2, and 89~90% across the task #3. Figure 8b shows the results with weight adjustments, applying only class weight, yielding an accuracy of 98% across task #1 and 93-94% across tasks #2 and #3. Figure 8c shows the results with weight adjustments, applying only task weight, yielding an accuracy of 93% across task #1 and 94~95% across tasks #2 and #3.In Figure 8a,b, rapid convergence is observed in task #1, while Figure 8c tends to exhibit initially similar convergence speeds across each task, later showing higher performance in task #2 and #3 compared to task #1. Figure 8d shows the results with weight adjustments, applying class and task weights, also yielding an accuracy of 98~99% across task #1 and 96-97% across tasks #2 and #3.Through the above experiments, it is evident that adjusting weights for both categories leads to higher performance.Therefore, it can be concluded that weight adjustment plays a crucial role in enhancing performance.

Performance Based on Input Shape
In Section 3.1.1,we described that we conducted several experiments with various input shapes based on the number of packets and bytes in the flow.Through these experiments, we set the optimal shape as 8 packets and 63 bytes.In this section, we compare the performance based on different input shapes.The input shape can be variably defined, and we set the range of packet counts to 4~8 and byte counts to 60~70, taking into account the handshake process and header (IP, TCP/UDP) byte sizes within encrypted communication.
As mentioned earlier, considering BERT's maximum input token limit of 512, we excluded cases where the total token count (packet count × byte count) exceeds 512.
Table A1 in Appendix A indicates the performance of the proposed method based on input shapes.The experiments were conducted for 20 epochs with the same experimental setup, as multiple experiments were required depending on the input shape.We selected the target task as the most challenging task #3 among the three tasks.In Table A1, the highest performance is observed with (8,63).Therefore, we selected the optimal input shape as 8 packets and 63 bytes.

Conclusions
Network traffic classification has been studied for a long time, and recently, a lot of research has been conducted on encrypted traffic.Most studies perform single-task classification, with DL-and transformer-based methods performing well.However, there are limitations in their efficiency and effectiveness given the increasingly diverse and complicated nature of traffic.
In this paper, we proposed multitask classification by using DistilBERT.The proposed method can learn multiple tasks in one model with one training.We applied a weight adjustment to improve the performance of our proposed method.The weight adjustment consists of class weights and task weights.Class weights mitigate the problem of data imbalance, and task weights prevent biased learning due to the difference in difficulty between tasks in multi-task classification.
To evaluate the proposed method, we conducted experiments in terms of accuracy and efficiency.Measured in terms of accuracy, the proposed approach achieves 96.89-99.29%accuracy on three tasks, showing higher performance compared to most existing methods.Furthermore, in terms of efficiency, it outperforms ET-BERT.While the proposed method exhibits lower efficiency compared to FastTraffic and MATEC, which focus on lightweight design, it achieves a significantly higher accuracy, ranging from 4.65 to 27.68% higher than the two mentioned methods.We discussed the performance impact of class weight and weight adjustment in Section 5.In addition, we validated the decision to select 8 packets and 63 bytes based on performance experiments with input data shapes (in Appendix A, Table A1).This input shape consists of packets generated during the handshake process within TLS, which is the most widely utilized today and typically remains unencrypted.Therefore, we believe it performs well despite being encrypted traffic.
However, the proposed method has some limitations.First, although the proposed method demonstrated high performance on the ISCX 2016 VPN/Non-VPN dataset, validation was only conducted on specific datasets.As the ISCX 2016 VPN/Non-VPN dataset comprises a small amount of data, leveraging AI models may yield high performance.Therefore, additional validation experiments on other datasets such as ISCX Tor are necessary to verify the performance of the proposed method.Second, as mentioned earlier, efficiency can vary depending on hardware performance and the dataset.In this paper, we evaluated the method using results presented in other studies; however, for a precise assessment, consistent experimental conditions and preprocessed datasets are necessary.Third, as previously mentioned, the proposed method utilizes eight packets in the flow, which results in a relatively high time to process a single packet.Fourth, the proposed method applies class weights to address the problem of imbalanced data.Although the class weights alleviate the problem of imbalanced data to some extent, they are still unevenly distributed.Nevertheless, the proposed method can perform three tests with one training and shows high performance and efficiency.
In future research, we plan to perform multi-task classification using diverse datasets.We will assess the effectiveness of our proposed method using identical experimental setups and preprocessed datasets for evaluation.Additionally, we plan to improve the model architecture and preprocessing methods to further enhance the performance and efficiency of the proposed method, including PT.

Figure 1 .
Figure 1.Overview of the network traffic classification.

Figure 1 .
Figure 1.Overview of the network traffic classification.

Figure 2 .
Figure 2. Architecture of the proposed method.
Encapsulation refers to the presence or absence of encryption on the target traffic and consists of two classes: VPN and Non-VPN.Category refers to the nature of the application and consists of six classes, excluding web browsing.Application indicates the application used and consists of sixteen classes.

Figure 2 .
Figure 2. Architecture of the proposed method.

Figure 3 .
Figure 3. Data composition of the pre-processed data for the three tasks: (a) encapsulation tas classes), (b) category task (six classes), (c) application task (sixteen classes).

Figure 3 .
Figure 3. Data composition of the pre-processed data for the three tasks: (a) encapsulation task (two classes), (b) category task (six classes), (c) application task (sixteen classes).

Figure 4 .
Figure 4. Structure of the proposed MTL.Figure 4. Structure of the proposed MTL.

Figure 4 .
Figure 4. Structure of the proposed MTL.Figure 4. Structure of the proposed MTL.

23 Figure 6 .
Figure 6.Learning curve for the training and testing.

Figure 6 .
Figure 6.Learning curve for the training and testing.

Figure 7 .
Figure 7. Distribution of data before and after applying class weights.(a) Encapsulation task (two classes), (b) category task (six classes), (c) application task (sixteen classes).

Figure 7 .
Figure 7. Distribution of data before and after applying class weights.(a) Encapsulation task (two classes), (b) category task (six classes), (c) application task (sixteen classes).

Figure 8 .
Figure 8. Test accuracy curve for weight adjustment.(a) No weight, (b) class weight, (c) task weight, and (d) class and task weights applied.

Figure 8 .
Figure 8. Test accuracy curve for weight adjustment.(a) No weight, (b) class weight, (c) task weight, and (d) class and task weights applied.

Table 1 .
Class information for three tasks in ISCX 2016 VPN/Non-VPN dataset.

Table 2 .
Performance for three tasks of ISCX 2016 VPN/Non-VPN Classification.
Comparison Results for Task #2: Category

Table 5 .
Results on efficiency evaluation.