The aim of this section is to provide the results on the approach for the four different attacks (DoS, Gear, Fuzzy, RPM) present in the data set based on the different hyperparameters. Most of the analysis is conducted for the normal traffic estimate and the training phase with the objective to identify the optimal hyperparameters (i.e.,
,
and feature id) over the entire data set. This objective is the focus of the first three subsections of this Section:
Section 4.1–
Section 4.3. In detail,
Section 4.1 evaluates the classification performance of each entropy measure for all the four attacks.
Section 4.2 provides the precision and recall for each entropy measure for a specific values of the threshold
by changing the values of the window size
and the ratio
. An example of the obtained values for False Positives and False Negatives is also provided.
Section 4.3 analyzes the impact of
and
on the classification performance of the Approximate Entropy, which is identified in the previous section as an optimal feature from the classification performance point of view. Then, the
Section 4.4 provides the results for the detection phase, where the entire data set for all the four attacks is split in three portions for each phase on the basis of the ratio value
: (1) the first portion, which is equal to the fraction
of the entire data set is used for Normal traffic estimate, (2) half of the (1-
) portion of the entire data set is used for Training and (3) the remaining half of the (1-
) portion of the entire data set is used for the Detection phase. Finally, the last sub
Section 4.5 describes the computing resources used in the analysis and provide the computing time for each of the three phases.
4.1. Evaluation of the Accuracy for Each Entropy Measure
In an initial step, it was evaluated how the entropy measures change when an attack is executed. It was found that some entropy measures provides more discriminating power in comparison to other entropy measures. In the proposed approach, this means that the range of values reported by the application of the specific entropy measure is significantly different in the legitimate traffic in comparison to the malicious traffic. An example is shown in
Figure 6a,b respectively for the Dispersion Entropy and the Approximate Entropy in the presence of the Gear attack for
and for
. The figures only show a small segment of the overall in-vehicle traffic, which has been evaluated. The pink (or light gray in a b/w representation of this paper) bars represent the set of messages when the attack is implemented (i.e., malicious traffic), while the plot shows the calculated entropy measure. It can be seen that these two specific entropy measures have a significant discriminating power because the range of values is significantly different in the legitimate traffic from the malicious traffic: the mean of the entropy measure in the presence of legitimate traffic is quite different from the mean of the entropy measure in the presence of malicious traffic. Then, an high detection accuracy is possible even for relatively small values of
.
Not all the entropy measures provide the same clear visual distinction between legitimate and malicious traffic. Then, an extensive analysis of all the 34 entropy measures for the four attacks was performed.
Figure 7a,b show the accuracy of the proposed approach at the variation of the parameter
between 0 and 4 in 0.1 steps of
for the DoS attack and
. Because of the large number of features, two pictures were created:
Figure 7a for entropy measures feature id from
to 19 and
Figure 7b for entropy measures feature id from
to 34. The results are consistent with the literature where a low value of the threshold
leads to a limited detection performance. It can be seen that for
approaching 4, the detection accuracy is very high, and it eventually reaches almost
detection accuracy. This is also consistent with literature because DoS attacks impact significantly the entropy values calculated on the in-vehicle traffic. The figures shows also that most of the entropy measures exhibit a similar detection performance with the significant difference of the Dispersion Entropy with c = 4 both with m = 2 (Feature Id = 6 in
Figure 7a) and m = 3 (Feature Id = 21 in
Figure 7b). A potential explanation of this deviation is that with C = 4 the calculation of the Dispersion Entropy is noisier than with c = 3 thus leading to a divergent behavior.
Because the accuracy values reported in
Figure 7a,b are quite similar, the following
Figure 8 provides a more detailed view of the accuracy obtained for each feature id (in this type of attack, the accuracy values are nevertheless quite near) for three selected values of
.
Similar results to the ones obtained for the DoS attack are also obtained with the Fuzzy attack as shown in
Figure 9a,b respectively for the entropy measures from
to 19 and from
to 34. We see a similar behavior than the one obtained for the DoS attack where for high values of
, it is possible to obtain a very high detection accuracy near 100% and most of entropy measures performs in a similar way at the increase of the value of
. As with the DoS attack, the Dispersion Entropy with c = 4 has a different behavior from the other entropy measures, obtaining a high value of detection accuracy for relative (in comparison to the other entropy measures) limited values of
but eventually converging to the other entropy measures for very high values (i.e.,
approaching 4). For high values of thresholds both DoS and Fuzzy attacks can be detected with almost 100% accuracy obtaining similar results to ones obtained on the same data set with more sophisticated techniques like Deep Learning [
27]. The reason is that both attacks generate traffic, which is significantly different from the normal in-vehicle CAN-bus traffic and entropy measures are able to detect such anomalous behavior if the threshold is large enough.
In a similar way to the DoS attack, the values of accuracy for each feature id are reported for three selected values of
in
Figure 10 for the Fuzzy attack.
As shown in the following results, the detection of spoofing attacks is more challenging because the injected malicious messages are quite similar to legitimate operations (e.g., related to the functioning of the gear).
This assumption can be validated by the results presented in
Figure 11a,b for the Gear attack. One initial observation is that increasing the threshold value
to the limit of 4 does not always lead to the optimal detection accuracy for all the entropy measure as some entropy measures presents an optimal values well below
. Then, these results are significant, because they show that the value of
must be chosen in an appropriate way. The second and more important observation is that the detection performance of each entropy measure is significantly different from each other. In particular, Approximate Entropy with values of m = 2 and r = 0.02 and r = 0.03 (respectively Id = 9 and Id = 10) and Approximate Entropy with values of m = 3 and r = 0.02 and r = 0.03 (respectively Id = 24 and Id = 25) are able to reach almost 100% detection accuracy for
(which is also their optimal value) while the proposed approach based on specific entropy measures is not able to reach an high detection accuracy. The Dispersion Entropy measures are able to reach an higher classification accuracy than the other entropy measures for low values of
, but then they reach a plateau around 90% detection accuracy even if the values of
is increased to the maximum value of 4. The classification based on Sample Entropy provides the worst results among all for this specific type of attack in particular for values of m = 2. The Shannon Entropy and Renyi Entropy used in the literature [
9] are in the middle of a ranking of the entropy measures and they exhibit an optimal detection accuracy for a value of the threshold
slightly above 2. It is noted that the Dispersion Entropy has the best accuracy for relatively low values of
, but then it reaches a peak and the accuracy decreases for increasing values of
as many other features. The reason for the behavior that the accuracy reaches a maximum and then decrease of higher values of
is that an increase of the value of
forces the algorithm to include samples containing CAN-bus messages of the RPM and Gear attacks. Because
is related to the standard deviation of the legitimate in-vehicle CAN-bus traffic, this behavior can be explained by looking again at the example of entropy measures of Dispersion Entropy and Permutation Entropy shown in
Figure 6. In particular,
Figure 6a shows the large variation of the values of Dispersion Entropy in normal traffic. Then, larger values of
may include sample related to attacks causing the algorithm to lose accuracy as the number of FP may increase. This may explain why the accuracy plot in
Figure 11 reaches a maximum and then slowly degrades. On the other side,
Figure 6b for the Approximate Entropy shows that the values of the calculated entropy are in tight range (i.e., small values of standard deviation). Then, even when
is approaching the value of 4, the algorithm can discriminate with high accuracy legitimate samples from samples containing malicious CAN-bus messages.
The detailed values of the accuracy for
equal to 3.0, 3.5 and 4.0 are shown for each feature in
Figure 12.
Similar results are obtained for the other spoofing attack: the RPM attack as shown in
Figure 13a,b. The choice of the entropy measure affects significantly the accuracy performance, with the Fuzzy Entropy measures performing worse than the other entropy measures and the Approximate Entropy providing the best accuracy.
To summarize, the Gear and RPM attacks are more difficult to identify in comparison to the DoS and Fuzzy attack. In addition, Gear and RPM attacks require tuning and careful choice of the entropy measure and the optimization values of
because some entropy measures are never able to reach very high accuracy (e.g., 99%) even for high thresholds and not necessarily the highest value of
is able to provide the optimal detection accuracy. To highlight more these significant results,
Table 3 presents the optimal values for each entropy measure and the corresponding value of
where the optimal accuracy value is obtained. We note that these results were obtained with
and
. Similar results were obtained with the other values of
and
but they are not provided here for lack of space. The impact of
and
is investigated in the next subsections.
As for the previous results,
Figure 14 shows the detailed values of the accuracy for
equal to 3.0, 3.5 and 4.0 for the RPM attack.
4.2. Recall and Precision for the Gear Attack
The aim of this subsection is to analyze how the recall and precision changes at the variation of and for each of the entropy measures for the specific value of the threshold .
The Bar
Figure 15a,b provide respectively the recall and precision for the Gear attack for each of the entropy measures for
and
at the variation of the parameter
. Please refer to
Table 2 for the description of each entropy measure associated with the specific Id appearing on the X axis of Figures. Both Figures show different bars for each value of ratio
. The result confirms the previous accuracy Figures (e.g.,
Figure 11), which shows that precision and recall can vary greatly among the entropy measures and the specific entropy measure must be carefully selected. The
Figure 15a,b show that the balance between the size of the training set and the test set impacts both metrics but in particular the precision. It can be seen from
Figure 15b that a larger training set (e.g., increasing value of
) provides higher values in a consistent way across all the entropy measures. This result indicates an important design decision as a larger value of
may provide more stable values of
and
to support a more stable choice of the hyperparameters and an improved detection accuracy. In particular, the precision (as indicated before) is probably more relevant than the recall in this particular detection problem, as the goal is to minimize the FP where an intrusion is wrongly detected an legitimate traffic thus allowing the attacker to implement the cybersecurity threat. On the other side,
Figure 15a shows that such trend is not the same across all the entropy measures. For example, the accuracy obtained with Feature Id = 10 (ApEn, m = 2 and
) does not change significantly. In addition, as shown more in detail in
Section 4.3, the improvement in classification performance due to the
depends both on the entropy measure but also the value of the threshold
. Then, all these factors should be taken in consideration.
In another phase of the study presented in this paper, the impact of the window size was evaluated. As in the previous case, only one specific attack is presented for space reasons. The Bar
Figure 16a,b provide respectively the recall and precision for the Gear attack for each of the entropy measures for
and
and by changing the size of the window
. The size of window size is another important hyperparameter: a small sample size may require more time for training as the data set is segmented in a greater number of segments on which the entropy measure must be calculated (thus requiring more time), but it may provide higher detection accuracy because the CAN-bus messages related to a cybersecurity attack would have in percentage more weight in the sample. The latter aspect is confirmed by the
Figure 16a,b because the recall is significantly higher for
rather than the larger values of
. On the other side, the precision is slightly better with larger values of
.
To complement the previous
Figure 16a,b and to provide an independent evaluation of FN and FP, the following
Figure 17a,b provide respectively the number of False Positives (FP) and False Negatives (FN) over all the samples.
4.3. Evaluation of Accuracy in Relation to and at the Variation of
This subsection shows the impact of the value of the threshold both for and . Two specific entropy measures (Id = 5 and Id = 10) are selected in relation to the specific GEAR attack.
The following
Figure 18a,b provide the plots respectively for the Feature Id = 5 (Dispersion Entropy) and Feature Id = 10 (Approximate Entropy) for different values of the ratio
and
= 72. Two main observations can be derived from
Figure 18a,b. The first one is that the optimal value of
changes considerably with the value of
for both entropy measures (similar results are obtained for the other entropy measures but they are not displayed here for lack of space). Then, the combination of
and
must be carefully identified. The second observation confirms the previous results that the optimal detection accuracy is obtained with high values of
. The larger is the portion of the data set is used to calculate mean and standard deviation and more accurate is the detection.
The following
Figure 19a,b provide the graphs respectively for the feature Id = 5 (Dispersion Entropy) and feature Id = 10 (Approximate Entropy) for different values of the window size
and
. In this case, the results shows that the impact of the
can be different across entropy measures and the optimal values are obtained through a proper combination of
with the entropy measure. In fact, in
Figure 19a a smaller window size
provides less detection accuracy than larger windows (e.g.,
) for all the values of
. For
Figure 19b, a small window size of
is able to provide the best accuracy for most of the values of
apart from values near 4, where larger windows sizes are more effective. A potential explanation for this behavior is related to the characteristics of each entropy measure. In particular Dispersion Entropy requires longer time series related to the
condition to provide correct results while Approximate Entropy can correctly estimate entropy with shorter time series.
4.4. Detection
This section provides the results for the detection phase. Although the previous sections on the training was conducted on the entire data set, the evaluation of the detection phase is performed by splitting in half the remaining of the data set ( of the entire data set), which is not used for the normal traffic estimate ( of the data set). For example, if a value of is used, the first half of the data set is used for the normal traffic estimate, one quarter is used for training and one quarter is used for detection. The calculation has been performed for all the different attacks (i.e., DoS, Fuzzy, Gear, RPM), for all the different sizes () and for different values of .
The results of the analysis are provided in
Figure 20, while the values of the reported accuracy for all the attacks and
are shown in
Table 4 together with the optimal feature id and the optimal
from the Training Phase. In particular,
Figure 20a–d show the accuracy respectively for the DoS, Fuzzy, Gear and RPM attacks at the variation of
. The results are consistent with the results presented in the previous subsections of this section where lower values of
can provide a relatively low accuracy for the detection of the in-vehicle attack. When the amount of data used for the normal traffic estimate is larger (e.g., values of
higher than 0.5) the accuracy increases significantly. This trend is similar for all the attacks. It is noted that the accuracy has a sharp increase in particular for the Gear and RPM attacks (
Figure 20c,d), which are more difficult to detect. Although this is consistent with the other results, it can also be based on the consideration that for such high values of
, the driving circumstances were very similar for the training and the detection phases; then it is easier for the algorithm to detect attacks because the optimal features and thresholds used for the detection were calculated in similar driving circumstances, thus explaining the very high accuracy. When the training and detection phase are based on the analysis of a relative large set of data (lower values of
), the driving circumstances may be different thus lowering the detection accuracy. Future developments of the research presented in this paper, could evaluate methods of statistical analysis, which take in consideration and identify different optimal features and thresholds for different driving circumstances. Such analysis could be quite complex because it must take in consideration the range of different driving circumstances, identify in which driving circumstances the detection is currently executed and it must choose the appropriate optimal features and thresholds. This complex analysis is out of the scope of this paper and it is reserved for future developments (see Conclusions
Section 5).
Table 4 provides additional information to the
Figure 20 as it identifies the optimal feature ids and values of
from the training phases, which are used for the detection phase. The results are consistent with the previous sections where Approximate Entropy (feature Id = 24) and Dispersion Entropy (feature Id = 21) provides optimal results. Shannon Entropy is also the optimal feature id for the DoS and Fuzzy attacks. The optimal values of
are generally high (more than 2.9), which is also consistent with the previous results show in
Section 4.1.