Explainable Internet Trafﬁc Classiﬁcation

: The problem analyzed in this paper deals with the classiﬁcation of Internet trafﬁc. During the last years, this problem has experienced a new hype, as classiﬁcation of Internet trafﬁc has become essential to perform advanced network management. As a result, many different methods based on classical Machine Learning and Deep Learning have been proposed. Despite the success achieved by these techniques, existing methods are lacking because they provide a classiﬁcation output that does not help practitioners with any information regarding the criteria that have been taken to the given classiﬁcation or what information in the input data makes them arrive at their decisions. To overcome these limitations, in this paper we focus on an “explainable” method for trafﬁc classiﬁcation able to provide the practitioners with information about the classiﬁcation output. More speciﬁcally, our proposed solution is based on a multi-objective evolutionary fuzzy classiﬁer (MOEFC), which offers a good trade-off between accuracy and explainability of the generated classiﬁcation models. The experimental results, obtained over two well-known publicly available data sets, namely, UniBS and UPC, demonstrate the effectiveness of our method.


Introduction
Network traffic classification represents one of the main challenges in network management nowadays. Indeed, Internet Service Providers (ISPs) devote most of their efforts to Internet traffic classification and management. Historically, the Internet traffic classification task was performed primarily for security reasons, as it permits the detection and identification of intrusions and malicious behavior. However, over recent years, the identification of Internet traffic type and workload has become necessary not only for security purposes, but also to perform traffic engineering and to make decisions on policing, traffic shaping, billing, dynamic Quality of Service, and so on. Most of the management techniques are built on top of classification results: as an example, consider billing and accounting, which are only possible if the traffic is first correctly classified. Moreover, attack detection techniques are usually built on top of a traffic classifier. Nonetheless, despite many years of research on the topic, an ultimate solution able to provide "good enough" performance is still under study.
In the literature, several approaches have been proposed to classify IP traffic flows according to the application that generated the traffic. Historically, the most commonly used method is to associate the observed traffic (using flow level data or a packet sniffer) with an application, on the basis of TCP or UDP port numbers [1]. However, port-based classification is inadequate [2], as mapping between ports and applications is not always well defined. As a consequence, in the last decade, research efforts have moved towards classification tools based on Machine Learning (ML) and Artificial Intelligence (AI) algorithms, which rely on statistical features [3].
Among these, Support Vector Machine (SVM) [4] and deep learning techniques [5] have emerged as powerful tools for traffic classification and other network application, such as intrusion detection [6] and other cyber security application [7,8]. Indeed, these techniques, and especially SVM, represent an almost de facto standard in the field. Such methods are able to provide very high accuracy values, often just observing a few traffic statistics computed over the first packets of each flow.
Nonetheless, all of the methods based on machine learning algorithms present a common drawback, as the generated models are seen as black boxes characterized by a low "explainability" level. Indeed, the classification result does not provide the practitioners with any information regarding the criteria that have taken to the given classification, or what information in the input data makes them arrive at their decisions. This is usually justified by the fact that the main goal traditionally pursued is to make the model matching reality (i.e., accurate models), without actually caring for explainable models.
Nowadays, several new requirements have emerged related to fairness or unbiasedness, privacy, reliability, robustness, causality, and/or trust posing the need of deploying systems that must provide explanations for the taken decisions, where necessary [9]. Therefore, traffic classifiers, as well as other traffic analysis systems, must be optimized not only for accuracy but also for the other criteria previously listed.
In such a context, a lot of research efforts are nowadays focusing on "explainable" methods (e.g., explainable artificial intelligence), where explainability "encompasses ML/AI systems for opening black box models, for improving the understanding of what the models have learned and/or for explaining individual predictions" [9]. In this specific area, a recent work in [10] clearly indicates that exploiting the synergy between Fuzzy Rule-Based Systems (FRBSs) and Evolutionary Algorithms is one of the most straightforward ways of combining accuracy and interpretability/explainability in machine learning-based tools.
For such a reason, in this work, which significantly extends the preliminary results presented in [11], we propose a traffic classification approach based on multi-objective evolutionary fuzzy classifiers (MOEFCs) [12,13]. Specifically, MOEFCs deal with the application of Multi-Objective Evolutionary Algorithms (MOEAs) [14] for generating a collection of Fuzzy Rule-Based Classifiers (FRBCs) characterized by different trade-offs between their accuracy and their explainability level [15]. We recall that FRBCs adopt (i) a rule base composed of linguistic IF-THEN rules and (ii) a database which contains the description of the linguistic terms adopted for the fuzzy discretization of each input variable. A specific inference mechanism is adopted for taking a decision whenever a new input is presented to the system.
In this contribution, we exploit the PAES-RCS algorithm, in which the accuracy is calculated in terms of percentage of correctly classified flows of internet traffic. As regards the explainability level, it is calculated in terms of total rule length (TRL), namely, the total number of conditions taken into consideration in the whole rule base. Low values of TRL are associated with rule bases which contain a reduced number of simple rules (i.e., rules in which a low number of conditions are adopted in their antecedent). Note that PAES-RCS has been successfully exploited in a number of recent contributions on real-world applications [15,16].
To evaluate and validate the proposed approach, we have used two publicly available data sets, namely, UniBS and UPC, showing that our system can achieve nearly optimal performance, while simultaneously guaranteeing the explainability of the classification results. We also compared the results achieved by MOEFCs with the ones achieved by two classical ML-based classification algorithms, namely, SVM and Decision Trees. SVM algorithms have been chosen as they represent the de facto standard among machine learning algorithms commonly adopted for solving the internet traffic classification problem. However, SVM models are characterized by a very low explainability level. Regarding decision trees, as from the trees it is possible to extract a set of decision rules, they represent a category of interpretable models among classical machine learning classifiers. However, their rules are not linguistic and the final models are often described by a large number of parameters, namely, the number of nodes and leaves. Thus, also the explainability level of decision trees is often compromised. As a counterpart, the proposed approach, based on MOEFCs, generates models characterized by good trade-offs between their accuracy and their explainability.
The remainder of the paper is organized as follows. In Section 2, we discuss some notable related works, while in Section 3.1 we describe the used data sets. Then, in Section 3.2 we introduce the experimented explainable traffic classification approach. The achieved results are shown in Section 4. Finally, Section 5 concludes the paper with some final remarks and future work.

Related Work
Research on traffic classification has been quite prolific in the years and, as a consequence, many works have been written on the topic. Therefore, the aim of this section is not to provide the reader with a comprehensive review of the related works (for which we refer the reader to the surveys on the topic), but just to point out some works significant for our specific proposal.
Machine Learning techniques have been first applied to network traffic classification in 1994 [17] and since then many different methods have been proposed, as detailed in some recent surveys [3,18].
Among the many proposals, particular interest has been raised by classifiers based on Support Vector Machine (SVM). One of the first significant work on the application of SVM to traffic classification is [19], where the authors apply one of the approaches to solving multi-class problems with SVMs and describe a simple optimization algorithm that allows the classifier to perform correctly with as little training as a few hundred samples. Since then, many other works have proposed SVM-based methods [4,[20][21][22][23] and, as a result, SVM is nowadays considered as a de facto standard in the field. Nonetheless, as already discussed, all of these works propose methods based on black-box models that do not provide any information about the classification criteria.
As far as Fuzzy Rule-Based Classifiers (FRBCs) are concerned, given their ability to deal with vague and noisy data and to explain how the classification task is performed, they have been widely exploited in several contexts, such as medical diagnosis applications [24], industrial applications [25], and Internet of Things [26]. In the years, several techniques to generate and optimize the structure of FRBCs have been proposed, often without taking into consideration how this maximization affects the FRBC explainability, but only in the last decade, researchers have also focused their attention on the explainability aspects of FRBCs [27]. As accuracy and explainability are conflicting objectives, the generation of the FRBS structure has been modeled as a multi-objective optimization problem. Multiobjective evolutionary algorithms (MOEAs) have been successfully employed to tackle this optimization problem and the term multi-objective evolutionary fuzzy systems (MOEFSs) has been coined [12,28] to identify FRBSs generated by MOEAs. Since then, many papers have proposed the use of MOEFSs in classification problems [16,[29][30][31][32][33].
In the specific context of traffic classification, there are some works [34,35] that propose the use of fuzzy models. The work in [34] discusses the application of hybrid models in which fuzzy theory elements are included into a neural network architecture. As regards the contribution discussed in [35], the authors propose an approach which combines a decision trees and fuzzy membership functions for dealing with noisy and vague data. Note that both works include in their experimental analysis a comparison with the traffic classification methods based on SVM. Nonetheless, to the best of our knowledge, our work is the first to propose and evaluate in a systematic way, the application of MOEFCs to generate explainable models for network traffic classification.

Materials and Methods
In this section, we first describe the data sets used to evaluate and validate our study, and then we detail the proposed Traffic Classification System.

Data Sets
We have used two distinct well-known and publicly available data sets: UniBS and UPC.

UniBS Data Set
The UniBS data set [36] is made of traffic collected in the University of Brescia campus network during three consecutive days (from 30 September 2009 to 1 October 2009), anonymized with the Crypto-PAn tool [37]. The dataset has been employed recently in the contributions discussed in [38,39].
The data set is coupled with a log file, containing for each flow, the information <timestamp> : <IP src> : <IP dst> : <transport port src> : <transport port dst> : <DPI verdict(s)> : <application name> : <transport protocol> In this work, we have considered the classes corresponding to the following applications: Mail, Skype, Firefox, Safari, BitTorrent, and Amule. Table 1 reports the number of instances per class, considering flows made of at least three, five, and ten packets. The UPC data set [40] is made from a subset (about 5.23 GB) of the full-payload traffic traces used in [41] and collected in the Universitat Politecnica de Catalunia during 66 days (from 25 February 2013 to 1 May 2013). Furthermore, these data have been recently used in the experiments on internet traffic classification carried out in [42,43].
As for the UniBS data-set, a log file accompanying the data set contains, for each flow, the information: where process_name corresponds to the application that generated the flow. Table 2 reports the number of instances per class, considering flows made of at least three, five, and ten packets.

The Proposed Traffic Classification System
In the following, we detail the proposed approach for generating explainable traffic classification models. The diagram depicted in Figure 1 represents the schema of the proposed internet traffic classification system. The data (both the Training Internet Flow (T_IF) and the Real-Time Internet Flow (RT_IF)) are preprocessed through a Feature Extraction strategy (discussed in Section 3.2.2), which generates a representation of the data by means of the chosen features. Note that while T_IF is composed of historical data collected for training the classification model, the RT_IF, in a real-world application, is continuously extracted from a network. The representation of the training data (T_IF representation) is used by the PAES-RCS algorithm to build a collection of FRBCs, namely, a collection of XAI classification models. Each model is characterized by a specific trade-off between accuracy and explainability, therefore the final user can select the one that best satisfies her/his requirements. This model (Selected XAI Model in the figure) is then applied on the representation of the Real-Time Internet Flow (RT_IF representation) to classify it. In the following, we first focus on the description of adopted multi-objective evolutionary learning scheme for generating FRBCs. Then, we describe two different feature extraction strategies, that we have experimented as preprocessing stage of the overall traffic classification task.

PAES-RCS Method
Evolutionary fuzzy systems, which consist of evolutionary algorithms applied to the design of fuzzy systems, are one of the greatest advances within the area of computational intelligence.
Among these, multi-objective evolutionary fuzzy classifiers (MOEFCs) are characterized by a good trade-off between accuracy and explainability level [12,16]. Therefore, these models have been widely used for approaching classification problems. Indeed, MOEFCs deal with the design of fuzzy rule based classifiers (FRBCs) by means of multi-objective evolutionary algorithms: during the evolutionary design process, both the accuracy and the explainability level of the models are concurrently optimized. At the end of the design process, a set of classification models, characterized by different trade-offs between accuracy and interpretability (Pareto front approximation), are available for the final user that will select the most suitable solution for its problem domain. The final models are usually characterized by compact fuzzy rules, namely, linguistic IF-THEN rules, which can describe the classification process in an explainable way.
An FBRC basically includes a rule base (RB), a database (DB) containing the definition of the fuzzy sets used in the RB, and an inference engine. RB and DB comprise the knowledge base of the rule-based system. Let X = {X 1 , . . . , X F } be the set of input variables and X F+1 be the output variable of the classifier. Let U f , with f = 1, ..., F, being the universe of the f th input variable X f .
With the aim of determining the class of a given input vector, we adopt an RB composed of M rules expressed as where C j m is the class label associated with the m th rule, and RW m is the rule weight, i.e., a certainty degree of the classification in the class C j m for a pattern belonging to the subspace delimited by the antecedent of rule R m . Usually, a purposely defined fuzzy set A f ,0 ( f = 1, . . . , F) is considered for all the F input variables. This fuzzy set, which represents the "do not care" condition, is defined by a membership function equal to 1 on the overall universe. The term A f ,0 allows generating rules that contain only a subset of the input variables.
A specific reasoning method employs the information it receives from the RB to determine the class label for a given input pattern. We adopt the maximum matching as reasoning method (see [16] for details).
Concerning the DB, we adopted triangular fuzzy sets: each fuzzy set A f ,j is identified by the tuples (a f ,j , b f ,j , c f ,j ), where a f ,j and c f ,j correspond to the left and right extremes of the support, and b f ,j to the core. In particular, in the experiments, we use strong fuzzy partitions, where a f ,  In order to concurrently design the RB and tune the parameters of the fuzzy sets, we adopt the PAES-RCS algorithm introduced in [44]. The multi-objective evolutionary learning scheme is based on the (2 + 2)M-PAES, which is an MOEA successfully employed in the context of MOEFSs during the last years. We concurrently optimize two objectives: the first objective considers the interpretability of the RB, calculated as the total rule length (TRL), that is, the number of propositions used in the antecedents of the rules contained in the RB; the second objective takes into account the accuracy, assessed in terms of classification rate.
In the learning scheme, we first generated an initial RB and then selected, during the evolutionary process, the most relevant rules and their conditions. Moreover, we concurrently tune the parameters of the fuzzy sets by using a mapping strategy based on a piecewise linear transformation [44]. Once we had defined an initial strong fuzzy partition for each input variable, we extracted the initial set of candidate fuzzy rules from a decision tree: in particular, in this work, we use a recent algorithm, discussed in [45], for generating multi-way fuzzy decision trees. One rule is then created for each path from the root to a leaf node.
In PAES-RCS, each solution is codified by a chromosome C composed of two parts (C R , C T ), which define, respectively, the RB and the positions of the representatives of the fuzzy sets, namely, the cores, in the transformed space.
Let J DT and M DT be the initial set of candidate rules generated by the decision tree and the number of rules of this RB, respectively. In order to generate compact and interpretable RBs, we allow that the RB of a solution contains at most M max rules. The In order to generate the offspring populations, we exploit both crossover and mutation. We apply separately the one-point crossover to C R and the BLX-α-crossover, with α = 0.5, to C T . As regards the mutation, we apply two distinct operators for C R and an operator for C T . More details regarding the mating operators and the steps of PAES-RCS can be found in [16,44].

Feature Extraction
The feature extraction phase has been designed and implemented so as to process realtime traffic captured by means of the pcap libraries. First of all, the traffic is reconstructed to identify the flows, defined by the 5-uple: source and destination IP addresses, source and destination ports, and protocol (note that, in this work, we consider bidirectional flows). Then, each 5-uple is transformed in a vector of features to be used as input of the FRBC, which is in charge of estimating the type of traffic.
In this work, we have experimented two distinct typologies of traffic features: • Statistical features: the flow is described by a set of statistical values (namely 21), reported in Table 3.
It is important to highlight that in this work, such features have only been computed for flows made of five or more packets.

•
Composite features: the flow is described by an array x ∈ R 3H−1 , where H is the number of analyzed flow packets, of higher granularity (i.e., packet level) features [34]: where

Experimental Results
In this section, we present the results of the experimental tests, carried out to validate and evaluate our proposal. The performance have been measured in terms of the following metrics (defined per class): Note that in all the tests, we have adopted a k-fold cross-validation approach, with k = 5.
In the following, to allow a proper comparison of our system against state-of-the-art classifiers, we present, at first, the performance achieved by SVM and C4.5 decision tree, used as benchmarks, and then the results obtained by our system.

SVM Classifier
As far as the SVM classifier is concerned, we have used the implementation available in WEKA Toolkit (https://www.cs.waikato.ac.nz/ml/weka/ accessed on 20 May 2021) based on the Sequential Minimal Optimization training algorithm [46]. The parameters of the algorithm have been set as In Table 4, we show the achieved accuracy on both datasets and for each feature extraction method. Moreover, Tables 5 and 6 show the results in terms of TPR and FPR for each class of the UniBS and UPC data sets, respectively. Regarding the accuracy, the best performance is achieved, for both datasets, adopting composite features and H = 10 (ACC = 0.874 over the UPC data set and ACC = 0.896 over the UniBS data set). In both cases, adopting statistical features, the SVM classifier achieves better performances rather than adopting composite features with H = 3, and even with H = 5 in the case of UniBS data set.
Finally, for a deeper analysis, Table 7 reports the confusion matrix for the UniBS case with composite features and H = 10 (note that for sake of brevity we do not show the confusion matrix for all the cases, as they would not add any significant insight). Note that, for the considered case, the worst results are obtained for Skype, which is often classified as Amule. Such a result can be justified by the fact that the two applications have a similar architecture.

C4.5 Decision Tree
The C4.5 decision tree has been taken into consideration because of the partly explainability of the results. Indeed, depending on the dimension of the tree and on the number of leaves, the classification results can be accompanied by an analysis of the criteria that take to a given decision. In this work, we have used the J48 classifier available in WEKA toolkit. As regards the parameters of the decision tree, we used the default parameters suggested in WEKA. In particular, the pruning of the decision tree is activated with a confidence parameter value of 0.25. In addition, the minimum number of instances per leaf is set equal to 2.
Similarly to the UniBS case, Table 8 shows the achieved overall performance, while Tables 9 and 10 report the results in terms of TPR and FPR for each class of the UniBS and UPC data sets, respectively. Note that in this case, the best accuracy is obtained in both cases with the statistical features (ACC = 0.981 over the UPC data set and ACC = 0.969 over the UniBS data set). Furthermore, differently from the SVM case, we can notice that the C4.5 does not present any critical result in terms of almost always unrecognized classes.
For allowing a deeper analysis of the achieved results, in Table 11 we report the confusion matrix for the UPC data set and statistical features. Note that the C4.5 offers almost optimal results over all the classes in this case. Indeed, the only misclassifications occur with the most similar classes, that is when considering Chrome and Firefox.

PAES-RCS
As for the previous cases, and also for the proposed method, we have run a 5-fold cross-validation, and for each fold we have run three trials (each with a different seed of the random number generator). The algorithm has been run with the parameters indicated in Table 12, and for each fold and each trial we have generated an approximation of the optimal Pareto front. In the following, we report the average results of three representative solutions ordered according to decreasing accuracy. Specifically, as discussed in [44], we sorted the FRBCs in each Pareto front approximation in ascending order of accuracy. Then, we extracted the First (the most accurate and the less explainable), the Median, and the Last solution (the less accurate and the most explainable). Total number of fitness evaluations 50,000 P C R Probability of applying crossover operator to C R 0.1 P C T Probability of applying crossover operator to C T 0.5 P MRB 1 Probability of applying first mutation operator to C R 0.1 P MRB 2 Probability of applying second mutation operator to C R 0.7 P M T Probability of applying mutation operator to Similarly to what done so far, in Tables 13 and 14 we present the overall performance over the UniBS and the UPC data-set, respectively, in terms of accuracy, number of rules Rules, and total rule length TRL. From the tables we can see that out system is able to achieve nearly optimal results, with an accuracy close to 0.9 in both the cases. Then, in Tables 15 and 16, we present the results in terms of TPR and FPR for each class on the UniBS and UPC data sets, respectively. Note that, apart with composite feature and H = 3, there is not any class that is mostly unrecognized (as for the SVM classifier). Moreover, it is also interesting to see that, differently from the C4.5 classifier, the proposed method is able to correctly classify Chrome, while it presents some issues in the classification of Firefox. For a deeper analysis, Tables 17 and 18 report the confusion matrix for the UniBS and UPC case with statistical features, respectively. As expected, these results highlight that, in the UPC case, the most critical case is represented by Firefox, which is often classified as Chrome.

Comparison among the Different Classification Models
To easily compare the achieved results, Table 19 reports the best results, in terms of accuracy, per each classifier on the two considered data sets, both for the training set and the test set, respectively.
Starting by comparing the performance of our method with those of SVM on the test set, it is easy to see that our method achieves more or less the same accuracy than SVM, with a maximum accuracy of 0.875 (against 0.874) on the UniBS data set, and 0.886 (against 0.896) over the UPC data set. On the contrary, considering again the test set, our method is outperformed, in terms of accuracy, by C4.5 over both the data sets.
Similar results are obtained on the training set. Nevertheless, in this case, note that the overfitting is very high for the SVM algorithm. Furthermore, the decision tree and the proposed PAES-RCS algorithms suffer from this problem, but in this case the phenomenon is less evident.  Nonetheless, as already discussed, our method is characterized by a high level of explainability. To quantify such an aspect, in Table 20 we report the complexity of our method (in terms of number of rules and TRL) and of the C4.5 algorithm (in terms of number of leaves and tree dimension). Note that we do not take into consideration SVM in this analysis, as it is well known that SVM must be considered as a "black box".
As it can be observed from the table, the higher accuracy of C4.5 is paid with a much higher complexity, which directly results in a lower explainability. Note that as far as complexity is concerned, that for our proposed method we have considered the "First" case, which has a much higher complexity, but an only slightly better accuracy, with respect to the "Median" case. Therefore, out method results even more convenient, considering the "Median" case. To further clarify the level of explainability of the proposed method, we finally analyze some examples of classification rules (created for the UniBS data-set). In Figure 3, we show a generic strong fuzzy partition that has been used for each variable in the experiment. The fuzzy partition consists of seven fuzzy sets, labeled with linguistic values ranging from Very Low (VL) to Very High (VH).
Given such fuzzy sets, the following are a few examples of classification rules: It is clear, that such rules, being linguistic rules, can be easily read and understood by an operator.
For the sake of completeness, we highlight that both UniBS and UPC are data sets that exhibit some level of imbalance. Therefore, we have applied a set of re-balancing techniques, but the obtained results did not show appreciable improvements. This is probably due to the fact that the level of unbalancing is not very high. Indeed, as can be seen from the tables and the confusion matrices discussed above, we have verified that poor results on specific classes are not due to the imbalance level but rather to the adopted feature extraction procedure and/or to classification model selected. Due to space reasons and to their scarce relevance, we have not reported all the results achieved adopting a re-balancing step of the training set.

Conclusions and Future Work
The development of "explainable" classification methods is attracting a lot of research efforts in several fields, such as network monitoring. This is highly justified by the newly emerged requirements in terms of fairness or unbiasedness, privacy, reliability, robustness, causality, and/or trust, which make the standard methods inadequate. For this reason, in this paper, we have proposed a traffic classification tool based on multi-objective evolutionary fuzzy classifiers.
Our proposal has been validated and evaluated over two well-known publicly available traffic data-sets (namely, UniBS and UPC) and has demonstrated optimal performance both in term of accuracy end explainability. Indeed, the achieved results show that our method is able to outperform the de facto standard method (i.e., SVM) both in terms of accuracy and explainability. Moreover, the proposed method is also able to offer a better accuracy-explainability trade-off than C4.5 classifier, in which a very high accuracy is paid in terms of very low level of explainability.
The main limitation of the proposed approach, based on XAI models for internet traffic classification, regards the fact that it may suffer from the "concept drift issue". Indeed, if a new set of internet applications appears in the monitored network, the system will not be able to identify it. This is due to the fact that the traffic flows associated with the new applications have never been seen by the XAI models during the training stage. This means that the models should be retrained or an incremental learning algorithm should be adopted for updating in real time the parameters of the models (i.e., the rules). This issue, not trivial at all, represents a hot research topic that will be considered in future works.
Author Contributions: Authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.