Intrusion Detection for in-Vehicle Communication Networks: An Unsupervised Kohonen SOM Approach

: The di ﬀ usion of embedded and portable communication devices on modern vehicles entails new security risks since in-vehicle communication protocols are still insecure and vulnerable to attacks. Increasing interest is being given to the implementation of automotive cybersecurity systems. In this work we propose an e ﬃ cient and high-performing intrusion detection system based on an unsupervised Kohonen Self-Organizing Map (SOM) network, to identify attack messages sent on a Controller Area Network (CAN) bus. The SOM network found a wide range of applications in intrusion detection because of its features of high detection rate, short training time, and high versatility. We propose to extend the SOM network to intrusion detection on in-vehicle CAN buses. Many hybrid approaches were proposed to combine the SOM network with other clustering methods, such as the k-means algorithm, in order to improve the accuracy of the model. We introduced a novel distance-based procedure to integrate the SOM network with the K-means algorithm and compared it with the traditional procedure. The models were tested on a car hacking dataset concerning tra ﬃ c data messages sent on a CAN bus, characterized by a large volume of tra ﬃ c with a low number of features and highly imbalanced data distribution. The experimentation showed that the proposed method greatly improved detection accuracy over the traditional approach.


Introduction
The automotive sector has been undergoing a radical transformation in recent years.Vehicles' cyber-physical systems are partially or totally controlled by software run by electronic devices increasingly interconnected with the outside world through networks of various types [1].A number of initiatives concerning smart mobility [2] and autonomous driving are, indeed, experiencing increasing development in urban areas and smart city contexts [3,4].
Automatic systems to maintain the lane, cruising speed, and movement in the queue; to park automatically; and to check the state of attention and sobriety of the driver [5], are just some of the advantages.At the same time, these advantages have significantly increased the attack surface area.There are several access points [6] that an attacker can use to try to compromise the security of the vehicle: connections to smartphones; USB inputs; the mobile network to receive information, transmit data, and make calls to external services [7,8]; Wi-Fi connections that can be used to connect other mobile devices on board the vehicle.Those are just some of the known channels [9] and in this scenario, continuous software improvements [10,11] become necessary in order to detect and respond in time to possible attacks.Proper methods and tools capable of managing the complexity [12] can guarantee not only the protection of the vehicle, but also and above all, that of the people.
In this research, we take into consideration the vehicle's internal network, the Controller Area Network (CAN), as it allows safety-critical electronic control units (ECUs) that are attached to sufficiently broadcast information in the form of CAN packets between them and other connected busses through several gateways [13].
The CAN bus system presents several critical vulnerabilities [14].For example, receiving nodes are unable to verify whether the received packet is legitimate or not since the origin of the packets is not provided.Consequently, ECUs can be used by attackers to falsify and send fake CAN packets.This makes a CAN bus system insecure and poorly equipped in order to identify which nodes launched the attacks.
In view of these considerations, security systems to protect the CAN bus became an urgent need [15].Intrusion detection techniques traditionally used in network security cannot be implemented in the automotive domain.Since network-based attacks are relatively new in the automotive field, new challenges arise in the development of efficient and adaptable systems for securing automotive networks [16].Since the CAN protocol can be frequently modified, a Machine Learning approach could be the proper way to implement a detection method which, learning by examples, is able to adapt to any change in the protocol.Most of the intrusion detection systems (IDSs) based on Machine Learning proposed in the literature are deployed in a supervised manner.This requires data to be completely labelled, even if this can be unfeasible considering the high volume of data generated by a real time CAN in milliseconds [13].Thus, an anomaly-based detection system implemented using an unsupervised Machine Learning approach is more desirable and convenient.
In the present work, we introduce a distance-based intrusion detection system aimed at identifying attacks sent on the CAN bus.The proposed system was based on an unsupervised Kohonen Self-Organizing Map (SOM) network, an Artificial Neural Network that can be trained both through supervised and unsupervised learning.The algorithm maps a high-dimensional data space to a low-dimensional one, preserving the topological properties of input data.It is a power classifier able to separate normal from anomalous data while preserving the topological relationship between the features with no need for labels.Due to the powerful capabilities in clustering and visualization of complex, highly dimensional data, the SOM network extensively evolved, thereby finding many applications in intrusion detection [17].The algorithm showed high performance as an anomaly detector in real-time systems, and compared to other intrusion detection techniques, revealed better performance by showing a shorter training time and higher detection efficiency [18][19][20].The SOM network is finding applications in new areas of security, never approached in a similar way before, such as anomaly-based detection [21].In the light of these, we propose to test the effectiveness and the efficiency of an unsupervised Kohonen SOM network in the automotive domain, since, to the best of our knowledge, it was never tested in said area before.
Many hybrid methods based on the integration of the Kohonen SOM network with other clustering methods were proposed in order to improve detection accuracy and reduce false alarm rates.One of the most common methods is based on the combination of the Kohonen SOM network with the k-means algorithm [22,23].Intrusion detection networks were usually tested on the well-known KDD99 dataset and on its refined version, the NSL-KDD dataset [24].These datasets have large numbers of features and experiments.References [25][26][27][28] showed that detection accuracy achieved its high performance when including all the features in the analysis.The richer the feature space is, the higher the detection rate achieved [17].
The goal of the present research is to evaluate the performance of an anomaly-based intrusion detection system using an unsupervised Kohonen SOM neural network for the identification of attack messages sent on the CAN bus.We propose a novel distance-based procedure to integrate Kohonen SOM network and k-means algorithm, which greatly improved accuracy in detecting attack messages compared to the traditional procedure.
The proposed method and the traditional procedure based on the combination of the Kohonen SOM network and k-means algorithm, were tested on open source data concerning traffic messages sent on a CAN bus 2.0B with a very complex structure, characterized by large volume of traffic with low number of features and a highly imbalanced data distribution.The dataset contains more than 2000 different kinds of messages sent totally at random on the CAN bus and all included in the analysis.Despite the complex structure of the dataset, the proposed method showed high detection accuracy with a low false negative rate.
The main contributions of the paper are: • An anomaly-based IDS implemented using unsupervised learning to identify intrusions on an in-vehicle communication network, in particular, a CAN bus.At the state of the art level, this is the first work which tests an unsupervised Kohonen SOM network as an anomaly detector in the automotive domain.

•
A novel distance-based procedure to integrate Kohonen SOM network and k-means algorithm, which greatly improves accuracy in detecting attack messages compared to the traditional procedure.Moreover, the proposed method significantly reduces false negative rate, which assumes a great importance in attack detection for in-vehicle CAN buses.Its value should be very low to ensure the safety of the vehicle.

•
The proposed method was tested on real car-hacking data, including DoS, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attacks.Data were obtained by logging CAN traffic via the OBD-II port from a real vehicle while message injection attacks were being performed.The performance of the proposed method was shown via computing evaluation metrics.

•
The proposed method was performed with remarkable results both using single datasets only containing one type of attack and also merging all types of attack into a unique dataset.

•
Most of the studies in the literature propose IDSs only able to detect periodic attacks but not aperiodic violations.The proposed method reveals remarkable performance in detecting both periodic and aperiodic intrusions.
The paper is organized as follows: Section 2 illustrates related works; Section 3 describes the CAN message structure; Section 4 explains the theoretical background used in the work; Section 5 shows the experimental process; Section 6 points out results and discussions; Section 7 sets out conclusions.

Related Works
Recent works showed that data generated by connected vehicles can be a great resource for the development of next generation cybersecurity solutions [29].In [30,31] the authors highlight the trend of increasing research interest in applications of Machine Learning (ML) and Deep Learning (DL) in cybersecurity for the automotive industry.
The learning types and their applicability in the automotive industry are: supervised ML models which deal with labelled data concerning automotive used to train the ML classification model; unsupervised/self-supervised ML models that create clusters from various vehicle data streams with no need for labelled data can be further analyzed to detect abnormal behavior; reinforcement learning models, although less mature than the first two, provide a means to develop autonomous cybersecurity solutions that can take human-defined meta-goals as input and make decisions to achieve that goal [29].
A cybersecurity solution need not to be limited using a single architecture [32] or a single model [33], and in accordance with that, our research investigates the cybersecurity solutions which propose a distance-based intrusion detection system based on an unsupervised Kohonen SOM network.Therefore, our goal is to test the SOM network in order to identify attacks within the CAN bus.
In the literature there are different ML models applied to automotive cybersecurity solutions (Table 1), such as the Bayesian network, to determine whether the vehicle is under attack, but also whether the attack has originated from the cyber or the physical domain [34]; the Deep Neural Network (DNN) to train in-vehicle network packets exchanged between ECUs to extract low-dimensional features, and it is used for discriminating normal and hacking packets [35]; the long short-term memory neural network to detect CAN bus attacks [36]; Convolutional Neural Networks (CNNs) to classify malware samples [37]; and Generative Adversarial Networks to generate the adversarial attacks, which can deceive and evade the intrusion detection system [38].
Each of them allows providing a solution for a specific class within the vehicle (network security, VANET situational awareness, vehicle intelligence, and others [29]), but there is no trace in the literature of a SOM network application in automotive context.The Kohonen SOM network is a popular non-linear model of unsupervised neural network for the solution of dimensionality reduction problems [39] and it found mostly in applications concerning security issues [17] because of its features of high detection rate, short training time, and high versatility [22].
Starting from the results obtained in the application of the Kohonen SOM network, we extended this model in the automotive domain.We propose an intrusion detection system to identify attack messages sent on the CAN bus based on an unsupervised Kohonen SOM network.In many studies, the SOM network was integrated with other clustering methods in order to improve the efficiency of the model [17].Wang et all. in [23] combined the unsupervised SOM network with the k-means algorithm using two different approaches and tested both methods on the KDD CUP-99 dataset, commonly used to test intrusion detection system.Their methods showed good stability of efficiency and clustering accuracy.Tan et all. in [22] also proposed an intrusion detection method based on the integration of the unsupervised SOM network with the k-means algorithm and tested their model on the NSL-KDD dataset, a refined version of the KDD CUP-99.Both the datasets are characterized by a high number of features, equal to 41, for each connection record.The proposed method relatively improved the accuracy of network intrusion and significantly reduced the number of clustering iterations than the SOM network.
We applied and evaluated the performance of the unsupervised Kohonen SOM network as an intrusion detection system on an in-vehicle communication network, in particular, on a CAN bus.We present an intrusion detection system based on a novel distance-based procedure for the integration of the Kohonen SOM network with the k-means algorithm and compare it with the traditional procedure.Performance of classification was statistically tested for the two methods using open source car hacking data concerning the traffic of messages sent on CAN bus 2.0B and consisting of four datasets, each containing a different type of attack.The same dataset was used to test the intrusion detection system proposed by [40,41].In the analysis, they individually considered each dataset containing a unique type of attack, whereas we tested the models first on a single dataset, and then by merging the four datasets into one containing all the four different types of attack.The structure of the data was very complex, containing a large volume of traffic with a low number of features, equal to 4, and a highly imbalanced data distribution.
The experimentation showed a great improvement in the accuracy of attack message detection and a significantly reduced false negative rate compared to the traditional procedure.

Control Area Network (CAN)
A Controller Area Network, or CAN, is the most commonly used network for control in automotive and manufacturing applications [42].The CAN interconnects a network of nodes and it is a serial, multimaster, multicast protocol, so when the bus is free any node can send a message, and all nodes may receive and act on the message.When a node begins to transmit messages, it prioritizes the messages.This allows you to transmit until the bus becomes inactive or until it is replaced by a node with a higher priority message.
There are four types of CAN messages: data frame (CAN 2.0A and CAN2.0B), which is the standard CAN message broadcasting data from the transmitter to the other nodes on the bus; remote frame, a message that is broadcast by a transmitter to request data from a specific node; an error frame may be transmitted by any node that detects a bus error; overload frames are used to introduce additional delay between data or remote frames.In this research, CAN 2.0B data frame [40] was taken into consideration (Table 2).The difference between a CAN 2.0A and a CAN 2.0B message is that CAN 2.0B supports both 11 bit (standard) and 29 bit (extended) identifiers.

Unsupervised SOM Neural Network
The Kohonen Self-Organizing Map (SOM) is a type of Artificial Neural Network (ANN) which allows the visualization of high-dimensional data on a two-dimensional map.The Kohonen SOM is a nonlinear mapping network aimed at computing similarities among data in the input layer and representing them in an output layer of interconnected neurons according to spatial constrains [44].
Most of techniques in a neural network use supervised learning based on back propagation methods for updating weights and error correction learning.Training of supervised networks requires a target variable.The Kohonen SOM network differs from other Artificial Neural Networks since it can be trained by unsupervised learning.The SOM network uses competitive learning in order to find similarities among data, clustering them into different classes of data [45], and it is characterized by a feed-forward structure with a single computational layer [46].
The Kohonen output network consists of a competitive layer where an n-dimensional codebook vector is assigned to each neuron in the map and the vector elements represent weights [47].SOM networks try to reproduce the topological order of input data through clusters of neurons and neighbors whose number is defined by the size of the map [48].Spatial constraints entail that neighboring neurons have similar codebook vectors.Input data vectors are assigned to neurons according to defined measures of distance between them [44] If two different input data vectors are similar, then they will be mapped in neighboring neurons on the network grid.Hence, input data vectors mapped on the same neuron or in the neighboring ones are similar.
The SOM algorithm computes similarities between each input data vector and the neurons' codebook vectors in order to find the most similar.The winning neuron, called Best Matching Unit (BMU), adjusts its codebook vectors basing on a weighted average in order to move closer to the input vector.The weight of the attraction between the BMU and the input data vector is one of the training parameters of the model also called learning rate α.This parameter changes at each iteration, decreasing during the training process and ensuring the convergence of the model [48].Additionally, the neighboring neurons adjust their codebook vectors in order to better match with the input vector, thereby ensuring the spatial constraints in order to preserve the topology of the map.
Figure 1 shows a simple illustration of the Kohonen SOM algorithm.After determining the number of neurons in the Kohonen map, a codebook vector is randomly initialized for each neuron.Then, the algorithm computes the distance between a CAN message vector random selected and all the neurons' codebook vectors in the map.The neuron with the smallest distance is the winner, also called the Best Matching Unit.The algorithm also identifies neurons with similar codebook vectors as neighboring neurons.Both the BMU and neighboring neurons are updated in order to move closer to the CAN message vector.The same procedure is repeated for all CAN messages and for a given number of iterations.
The Kohonen SOM network can be trained using the online and the batch algorithms [49].In the online algorithm, the BMU and the neighborhood neurons are adjusted immediately after an input vector is presented to the network.In the batch algorithm, the BMU and the neighborhood neurons are updated after all the input data vectors are presented to the network [50].
Formally, we defined the input data vectors Xi, with i = 1, . . ., n, the Kohonen neurons Rj, with j = 1, . . ., m, and the codebook vectors Wj with j = 1, . . ., m associated to each neuron.The number of elements of the codebook vector equals the number of variables in the input data vector [51].We also set the number of iterations t with t = 1, . . ., s.The number of iterations to complete the leaning process is expressed in epochs.One epoch indicates the steps of the learning algorithm that allow a complete presentation of the input dataset to the network in order to be learned.
Many different measures of similarity can be used to measure the distance between input data vectors X i and neurons' codebook vectors W j [48], such as Manhattan, Tanimoto, Bray Curtis, Canberra, and Chebyshev distances.However, Euclidean distance normally gives slightly better classification results and epoch t can be defined as follows: where Euclidean distance is computed between the CAN message vector Xi, randomly selected, and all the codebook vectors W j with j = 1, . . ., m on the Kohonen map.Subsequently, the neuron associated to the codebook vector W j with the minimum distance to X i is the winning neuron BMU.The distance to the BMU at epoch t is here denoted with the subscript c: Future Internet 2020, 12, x FOR PEER REVIEW 8 of 25 Formally, we defined the input data vectors Xi, with i=1,…,n, the Kohonen neurons Rj, with j=1,…,m, and the codebook vectors Wj with j=1,…,m associated to each neuron.The number of elements of the codebook vector equals the number of variables in the input data vector [51].We also set the number of iterations t with t=1,…,s.The number of iterations to complete the leaning process is expressed in epochs.One epoch indicates the steps of the learning algorithm that allow a complete presentation of the input dataset to the network in order to be learned.
Many different measures of similarity can be used to measure the distance between input data vectors Xi and neurons' codebook vectors Wj [48], such as Manhattan, Tanimoto, Bray Curtis, Canberra, and Chebyshev distances.However, Euclidean distance normally gives slightly better classification results and epoch t can be defined as follows: where Euclidean distance is computed between the CAN message vector Xi, randomly selected, and all the codebook vectors Wj with j=1,…,m on the Kohonen map.Subsequently, the neuron associated to the codebook vector Wj with the minimum distance to Xi is the winning neuron BMU.The distance to the BMU at epoch t is here denoted with the subscript c: Once BMU is found, the neuron and its spatial neighbors are updated by the following: where hjc(t) is the neighborhood function.The rate of change at different neurons around the BMU depends on the mathematical form of the neighborhood function.This function has a very central role in SOM networks since it preserves the topological properties of the input data.A variety of neighborhood functions can be used, but the most applied in SOM neural networks is the Gaussian neighborhood function: Once BMU is found, the neuron and its spatial neighbors are updated by the following: where h jc (t) is the neighborhood function.The rate of change at different neurons around the BMU depends on the mathematical form of the neighborhood function.This function has a very central role in SOM networks since it preserves the topological properties of the input data.A variety of neighborhood functions can be used, but the most applied in SOM neural networks is the Gaussian neighborhood function: where α(t) is the learning rate function which is a function monotonically decreasing at each iteration t and r j is the position of neuron j.
The online training algorithm of the SOM can be implemented using a stepwise recursive procedure.The pseudo-code is detailed in Algorithm 1.
where (t) is the learning rate function which is a function monotonically decreasing at each iteration t and rj is the position of neuron j.
The online training algorithm of the SOM can be implemented using a stepwise recursive procedure.The pseudo-code is detailed in Algorithm 1.
Algorithm 1: Kohonen  The batch algorithm is a variant of the traditional online SOM algorithm.Neurons' codebook vectors are adjusted only after all the input data vectors Xi in the input layer are assigned to their winning neuron's BMU in the Kohonen network.Codebook vectors of BMU and neighbors' neurons are updated as follows: where c(i) is the index of the winning neuron's BMU for the input data vector Xi and n is the number of input data vectors, at iteration t.

K-Means Clustering Algorithm
The k-means is one of the most used clustering algorithms for large datasets.It is an unsupervised Machine Learning algorithm which allows one to partition data into K groups, minimizing the variance within the clusters.After determining the number of K clusters, K input data vectors are randomly selected as initial cluster centroids.The Euclidean distance is computed to assign input data vectors to the closest centroids.The K centroids are updated and the input data The batch algorithm is a variant of the traditional online SOM algorithm.Neurons' codebook vectors are adjusted only after all the input data vectors Xi in the input layer are assigned to their winning neuron's BMU in the Kohonen network.Codebook vectors of BMU and neighbors' neurons are updated as follows: where c(i) is the index of the winning neuron's BMU for the input data vector X i and n is the number of input data vectors, at iteration t.

K-Means Clustering Algorithm
The k-means is one of the most used clustering algorithms for large datasets.It is an unsupervised Machine Learning algorithm which allows one to partition data into K groups, minimizing the variance within the clusters.After determining the number of K clusters, K input data vectors are randomly selected as initial cluster centroids.The Euclidean distance is computed to assign input data vectors to the closest centroids.The K centroids are updated and the input data vectors are reassigned at each iteration.These steps are iteratively repeated until input data assignments stop changing and convergence is achieved.
Formally, we defined the input data vectors X i , with i = 1, . . ., n, the number of clusters K and the centroids Φ k with k = 1, . . ., K. The pseudo-code is detailed in Algorithm 2. vectors are reassigned at each iteration.These steps are iteratively repeated until input data assignments stop changing and convergence is achieved.Formally, we defined the input data vectors Xi, with i=1,…,n, the number of clusters K and the centroids with k=1,…,K.The pseudo-code is detailed in Algorithm 2.

Materials and Methods
We tested a distance-based intrusion detection system to identify attack and anomaly messages injected on a CAN bus.The intrusion detection system was based on a hybrid unsupervised Kohonen SOM neural network in order to improve the efficiency of the model in detecting attack messages.We proposed a novel distance-based procedure to integrate the unsupervised Kohonen SOM network with the k-means algorithm and compared it with the traditional procedure.Both the methods were tested on open source data containing four datasets, each with a different type of attack: DoS attack, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attack.The datasets were created by logging the CAN network 2.0B traffic via the OBD-II port from a real vehicle while message injection attacks were being performed [40,52].Each dataset contains a total of 30 to 40 minutes of CAN traffic with 300 intrusions of messages injected for 3 to 5 seconds.The Kohonen SOM network was first tested on single datasets separately; then, they were merged into a unique mixed dataset containing all types of attack.
In the DoS attack database, attack messages with dominant CAN IDs are injected on the CAN bus with the aim of tampering with the accessibility to the network.The spoofing gear and spoofing RPM datasets contain attack messages concerning, respectively, the driver gear and the RPM gauge aimed at changing the status on the instrument panel.In the fuzzy dataset, messages of spoofed random CAN ID and data values are injected in order to damage the vehicle's functionality, due to the manipulation of normal CAN ID and data values.
The experimentation was carried out in the following steps (Figure 2): 1.Data pre-processing: Data were pre-processed in order to be given input to the network.We analyzed open source car hacking data including different kinds of attack messages.In

Materials and Methods
We tested a distance-based intrusion detection system to identify attack and anomaly messages injected on a CAN bus.The intrusion detection system was based on a hybrid unsupervised Kohonen SOM neural network in order to improve the efficiency of the model in detecting attack messages.We proposed a novel distance-based procedure to integrate the unsupervised Kohonen SOM network with the k-means algorithm and compared it with the traditional procedure.Both the methods were tested on open source data containing four datasets, each with a different type of attack: DoS attack, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attack.The datasets were created by logging the CAN network 2.0B traffic via the OBD-II port from a real vehicle while message injection attacks were being performed [40,52].Each dataset contains a total of 30 to 40 minutes of CAN traffic with 300 intrusions of messages injected for 3 to 5 seconds.The Kohonen SOM network was first tested on single datasets separately; then, they were merged into a unique mixed dataset containing all types of attack.
In the DoS attack database, attack messages with dominant CAN IDs are injected on the CAN bus with the aim of tampering with the accessibility to the network.The spoofing gear and spoofing RPM datasets contain attack messages concerning, respectively, the driver gear and the RPM gauge aimed at changing the status on the instrument panel.In the fuzzy dataset, messages of spoofed random CAN ID and data values are injected in order to damage the vehicle's functionality, due to the manipulation of normal CAN ID and data values.
The experimentation was carried out in the following steps (Figure 2): 1. Data pre-processing: Data were pre-processed in order to be given input to the network.We analyzed open source car hacking data including different kinds of attack messages.In particular, we considered four different datasets containing DoS attacks, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attacks.Data are available at [53].

2.
Experimental process: We tested the proposed method based on a novel procedure to integrate the Kohonen SOM network with a k-means algorithm in order to improve the performance of the model in terms of accuracy in detection of attack messages and reduction of false negative rate.We also compared the proposed method with the traditional procedure.Both methods were tested, first, on each attack dataset separately, and then on the mixed dataset.

3.
Evaluation of the performances of the models.
Future Internet 2020, 12, x FOR PEER REVIEW 11 of 25 particular, we considered four different datasets containing DoS attacks, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attacks.Data are available at [53]. 2. Experimental process: We tested the proposed method based on a novel procedure to integrate the Kohonen SOM network with a k-means algorithm in order to improve the performance of the model in terms of accuracy in detection of attack messages and reduction of false negative rate.We also compared the proposed method with the traditional procedure.Both methods were tested, first, on each attack dataset separately, and then on the mixed dataset.3. Evaluation of the performances of the models.

Dataset Pre-Processing
The DoS attack, spoofing the drive gear, and spoofing the RPM gauge datasets present quite regular structures, since messages sent on the CAN bus are characterized by a limited number of different kinds of CAN ID. Figure 3 shows the frequency distribution of the CAN ID identifiers for each dataset and for the mixed dataset.The DoS dataset presents a very simple structure since it contains normal messages sent with 26 unique CAN IDs and a high frequency of attack messages sent with one different unique CAN ID.Spoofing the drive gear and spoofing the RPM gauge datasets also show regular structures including a total of 26 unique CAN IDs sending normal messages, one of whom sends attack messages as well.

Dataset Pre-Processing
The DoS attack, spoofing the drive gear, and spoofing the RPM gauge datasets present quite regular structures, since messages sent on the CAN bus are characterized by a limited number of different kinds of CAN ID. Figure 3 shows the frequency distribution of the CAN ID identifiers for each dataset and for the mixed dataset.The DoS dataset presents a very simple structure since it contains normal messages sent with 26 unique CAN IDs and a high frequency of attack messages sent with one different unique CAN ID.Spoofing the drive gear and spoofing the RPM gauge datasets also show regular structures including a total of 26 unique CAN IDs sending normal messages, one of whom sends attack messages as well.
These three datasets are quite simple to analyze and the proposed model completely succeeded in detecting attack messages when tested on them.The fuzzy dataset presents a very complex structure which is really difficult to analyze, since messages are sent on the CAN bus using 2017 different unique CAN IDs.Attack messages are sent using all CAN IDs, just 37 of whom are used to also send normal messages.Moreover, messages sent using 97.1% of unique CAN IDs show a relative frequency of less than 0.1% and are transmitted totally randomly.Hence, the mixed dataset, the result of merging all datasets, highlights an even greater complexity in the analysis.
All datasets included the following information: recorded time in seconds, timestamp, identifier of CAN message in HEX, CAN ID, number of data bytes from 0 to 8, DLC, data values, DATA[0~7], and a label R or T which represent, respectively, normal or attack messages (Table 3).
In order to run the network, data were properly pre-processed.CAN ID identifiers and data values were dealt with a semantic approach considering each CAN ID identifier as a category of messages sent by an ECU and data values as the related value information [54].Since SOM networks can process only numerical data, these categorical data were transformed into a matrix representation using the one-hot encoding technique.Each CAN ID identifier was represented by a column with value 1 if the message was sent with that CAN ID and 0 otherwise.Data values were merged in a unique string representing the value information of the CAN message.The new variable obtained was also transformed into a matrix representation using the one-hot encoding technique.
With regard to time, data were not processed in chronological order, but in relation to CAN messages period.CAN messages can be periodic, sporadic, or aperiodic.Periodic messages occur at regular time intervals, sporadic messages are sent with a minimum time interval, and aperiodic messages are sent at totally random times [41].Starting from timestamp, we derived a new variable S whose elements Si, with i = 1, . . ., n, where n is the total number of messages, express the time in milliseconds between two successive instances sent on the bus with the same CAN ID identifier.The datasets contained high numbers of periodic, sporadic, and even aperiodic messages with different CAN IDs sent at different times and with different frequencies.In the analysis we included all kinds of messages.The analysis of the fuzzy dataset was the most complex since the structure of the dataset was totally random.These three datasets are quite simple to analyze and the proposed model completely succeeded in detecting attack messages when tested on them.The fuzzy dataset presents a very complex structure which is really difficult to analyze, since messages are sent on the CAN bus using 2017 different unique CAN IDs.Attack messages are sent using all CAN IDs, just 37 of whom are used to also send normal messages.Moreover, messages sent using 97.1% of unique CAN IDs show a relative frequency of less than 0.1% and are transmitted totally randomly.Hence, the mixed dataset, the result of merging all datasets, highlights an even greater complexity in the analysis.
All datasets included the following information: recorded time in seconds, timestamp, identifier of CAN message in HEX, CAN ID, number of data bytes from 0 to 8, DLC, data values, DATA[0~7], and a label R or T which represent, respectively, normal or attack messages (Table 3).Finally, numerical variables, namely, S and DLC, were normalized in order to avoid bias in the training process that can be generated when dealing with very large input vectors [47].Normalization was obtained using a linear transformation to scale numerical variables to have values between 0 and 1 as follows: where Z 0 is the value of the generic numerical variable Z before normalization and Z n is the new value of Z after normalization.Z min and Z max , respectively, are the minimum and the maximum value of Z in sample data.We define X i , with i = 1, . . ., n, input data vectors corresponding to CAN messages sent on the CAN bus.X represents the input layer processed by the SOM network.Labeled data were not included in the analysis since we trained using unsupervised learning (see Table 4).

System Architecture
After data pre-processing, we tested a distance-based intrusion detection system aimed at identifying attack or anomalous messages injected on the CAN bus.We implemented a hybrid unsupervised Kohonen network in order to classify attack and normal messages based on global and local similarity among input data vectors [44] using two approaches.The models were tested on samples of 10,000 CAN messages vectors for each dataset.
Network initial learning parameters were defined using a trial and error process.Since our goal was to separate input data vectors into two clusters, attack and normal messages, we trained the network on small maps.Map size does not need to be very large, even with a large number of input data vectors.Training large maps is a time-consuming process since all input data vectors are compared with all the neurons in the map [55].
After many trials, on maps of different sizes, we obtained the best performance in terms of prediction accuracy training for the network on a 2 × 2 (four neuron) map using the Gaussian neighborhood function.The learning rate α was initially set to 0.5 linearly decreasing in a training process of 100 epochs.The Euclidean distance was used to compute the similarity between input message data vectors and neurons in the SOM map since it retuned the best results.Data were split between 80.00% for training the model and 20.00% for testing prediction accuracy.
The unsupervised Kohonen network assigned all input data vectors to the four neurons in the R X map.In order to classify input messages vectors in two clusters, attack and normal, we combined the output of the SOM network with the k-means algorithm using two distinct procedures.
Using the traditional procedure, generally used to cluster local data classified by the Kohonen SOM network [17,22,23], the output of the trained network was given as input to a k-means algorithm (SOMK-C).The algorithm processed the neurons' codebook vectors in order to classify the four neurons into two groups.Input CAN message vectors were then assigned to the same group of the corresponding neuron.
Compared to the traditional approach, we propose a novel procedure to integrate the Kohonen SOM network with the k-means algorithm (SOMK-D) (see Figure 4).The proposed approach relies on the assumption that the Kohonen network classifies input vectors in clusters based on distance between input vectors and neurons.Thus, in the traditional procedure the k-means algorithm processed neurons' codebook vectors, whereas in the proposed approach distance between input vectors and neurons were processed.
The SOMK-D procedure can be illustrated as follows: • Input CAN message data vectors were weighted depending on their frequency in the whole traffic dataset, since the nature of attack and normal messages sent in the CAN bus differs in terms of structure and frequency.

•
Distances between weighted input CAN vectors and neurons in the Rx map were computed training the Kohonen SOM network.

•
Distances were used as input of the k-means algorithm.

•
The k-means algorithm was implemented to classify input vectors in two clusters based on their distance to the corresponding winning neurons' BMU.
The algorithm is detailed in Algorithms 3 and 4. Results highlight a great improvement in terms of detection accuracy and a significant reduction of the false negative rate compared to the traditional procedure, as shown in Section 6.The algorithm is detailed in Algorithms 3 and 4. Results highlight a great improvement in terms of detection accuracy and a significant reduction of the false negative rate compared to the traditional procedure, as shown in Section 6.The algorithm is detailed in Algorithms 3 and 4. Results highlight a great improvement in terms of detection accuracy and a significant reduction of the false negative rate compared to the traditional procedure, as shown in Section 6.
The Kohonen neural network was implemented using the R project available at the repository: http://cran.r-project.org.The Kohonen neural network was implemented using the R project available at the repository: http://cran.r-project.org.

Performance Evaluation Metrics
Traditional classification metrics were used to evaluate the performances of the two explained methods tested on the described datasets.
In particular, given TP (true positive) and TN (true negative)-the numbers of CAN messages correctly classified, respectively, as attack or normal; and FP (false positive) and FN (false negative)-the numbers of CAN messages incorrectly classified, respectively, as attack or normal, we calculated: the ratio of correctly classified instances, the ratio of correctly detected errors to the total of detected errors, the ratio of the correctly detected errors to the total of actual errors including not-detected ones, a weighted average of precision and recall.We also computed the false negative rate (FNR), that is, the fraction of undetected attacks, asfollows: The FRN measure has great importance in attack detection for in-vehicle CAN buses and its value should be very small since even a very small number of undetected attacks can cause damage in the vehicle, impairing safety.

Performance Evaluation Metrics
Traditional classification metrics were used to evaluate the performances of the two explained methods tested on the described datasets.
In particular, given TP (true positive) and TN (true negative)-the numbers of CAN messages correctly classified, respectively, as attack or normal; and FP (false positive) and FN (false negative)-the numbers of CAN messages incorrectly classified, respectively, as attack or normal, we calculated: the ratio of correctly classified instances, the ratio of correctly detected errors to the total of detected errors, the ratio of the correctly detected errors to the total of actual errors including not-detected ones, a weighted average of precision and recall.We also computed the false negative rate (FNR), that is, the fraction of undetected attacks, as-follows: The FRN measure has great importance in attack detection for in-vehicle CAN buses and its value should be very small since even a very small number of undetected attacks can cause damage in the vehicle, impairing safety.

Results and Discussion
As described in Section 5, we made a distance-based intrusion detection system aimed at identifying attack messages injected into the CAN bus.The system was based on an unsupervised Kohonen SOM network combined with a k-means algorithm using two different approaches.The models were tested first by individually training DoS, spoofing gear, spoofing RPM, and fuzzy datasets, and then by merging the four different kinds of attack into a unique mixed dataset.The test was conducted on samples of 10,000 CAN messages vectors for each dataset.Samples were randomly selected, balancing the ratio of CAN ID identifiers, and were split into 80.00% training and 20.00% test sets, again stratifying the CAN ID identifiers for all the datasets with the exception of the fuzzy dataset.There was a high number of different messages identifiers in the fuzzy dataset-2017 unique CAN IDs, 97.1% of whom showed a relative frequency less than 0.1%.Since the nature of attack and normal messages sent in the CAN bus differs in terms of structure and frequency, we weighted input message data vectors depending on their frequency in the whole traffic dataset.The training sets were used to train the algorithms and the test sets to evaluate the models' performances.
We trained the unsupervised Kohonen SOM network on a 2 × 2 (four neuron) map as described in Section 5.2.The network computed the codebook vector for each neuron in the Kohonen map and the input CAN message vectors were assigned to the closest neurons.Figure 5 illustrates results of the unsupervised Kohonen SOM network, respectively, for DoS, spoofing gear, spoofing RPM, fuzzy, and mixed datasets.In particular, Figure 5 shows input CAN message vectors assigned to neurons in the map by the Kohonen SOM network for test sets.It is evident that attack and normal messages are well separated in DoS, spoofing gear, and spoofing RPM datasets.The distinction is less clear in the fuzzy dataset, due to its more complex structure, and, consequently, in the mixed dataset.Following the traditional procedure, the codebook vectors computed by the Kohonen SOM network were given as input to the k-means algorithm, in order to cluster the neurons in two groups, attack and normal (SOMK-C).In each dataset, the k-means algorithm clustered three neurons as normal, represented in yellow, and one neuron as attack, the red one.Results for test sets are shown in Figure 6.We trained the unsupervised Kohonen SOM network on a 2 x 2 (four neuron) map as described in Section 5.2.The network computed the codebook vector for each neuron in the Kohonen map and the input CAN message vectors were assigned to the closest neurons.Figure 5 illustrates results of the unsupervised Kohonen SOM network, respectively, for DoS, spoofing gear, spoofing RPM, fuzzy, and mixed datasets.In particular, Figure 5 shows input CAN message vectors assigned to neurons in the map by the Kohonen SOM network for test sets.It is evident that attack and normal messages are well separated in DoS, spoofing gear, and spoofing RPM datasets.The distinction is less clear in the fuzzy dataset, due to its more complex structure, and, consequently, in the mixed dataset.Following  Table 5 shows distance metrics distinguished for attack and normal messages as classified by the SOMK-C for training set.Plots and related metrics highlight regular patterns and clear separation in distances in DoS, spoofing gear, and spoofing RPM datasets, which were slightly less evident in fuzzy and mixed datasets.In light of this, we tried to improve the efficiency of the Kohonen SOM network in identifying attack messages using a second procedure that takes into account the evident separation in distances.We tried to enhance the separation between attack and normal messages by clustering distances from input CAN message vectors and corresponding winning BMU neurons.With that aim, after computing Euclidean distances between each input CAN message vector and its winning BMU neuron, we clustered them into two groups using the k-means algorithm (SOMK-D).Results for the test sets are shown in Figure 8 where it is evident how this procedure greatly improves the efficacy of the model in terms of detecting attack messages-completely succeeding in DoS, spoofing gear, and spoofing RPM datasets.normal (SOMK-C).In each dataset, the k-means algorithm clustered three neurons as normal, represented in yellow, and one neuron as attack, the red one.Results for test sets are shown in Figure 6.
Figure 7 represents, for each dataset, the distance between each input CAN message vector and its winning neuron BMU computed by the Kohonen SOM network for the training set.Blue and green points in the plots represent, respectively, attack and normal messages.Table 5 shows distance metrics distinguished for attack and normal messages as classified by the SOMK-C for training set.Plots and related metrics highlight regular patterns and clear separation in distances in DoS, spoofing gear, and spoofing RPM datasets, which were slightly less evident in fuzzy and mixed datasets.In light of this, we tried to improve the efficiency of the Kohonen SOM network in identifying attack messages using a second procedure that takes into account the evident separation in distances.We tried to enhance the separation between attack and normal messages by clustering distances from input CAN message vectors and corresponding winning Evaluation metrics for SOMK-C and SOMK-D procedures are shown in Tables 6 and 7, respectively, for training and test sets.The SOMK-D approach compared to the SOMK-C, improves evaluation metrics results in all the datasets.In SOMK-D, values of accuracy, precision, recall and F1 increase to 100.00% for DoS, spoofing gear, and spoofing RPM datasets both in training and test sets.The false negative ratio (FNR) is, consequently, equal to 0.00% in the same datasets.In the fuzzy dataset, the SOMK-D improved accuracy from 73.20% to 99.58% in the training set and from 72.55% to 99.40% in the test set.Precision and recall increased from 0.46% and 0.20%, respectively, to 99.66% and 98.07% in the training set, and from 0.00% and 0.00%, respectively, to 98.92% and 97.87% in the test set.The false negative ratio (FNR) was reduced from 99.80% to 1.93% in the training set and from 100.00% to 2.13% in the test set.A slight overfitting is noted in the mixed dataset.Results for the test sets are shown in Figure 8 where it is evident how this procedure greatly improves the efficacy of the model in terms of detecting attack messages-completely succeeding in DoS, spoofing gear, and spoofing RPM datasets.
Evaluation metrics for SOMK-C and SOMK-D procedures are shown in Table 6 and 7, respectively, for training and test sets.The SOMK-D approach compared to the SOMK-C, improves evaluation metrics results in all the datasets.In SOMK-D, values of accuracy, precision, recall and F1 increase to 100.00% for DoS, spoofing gear, and spoofing RPM datasets both in training and test sets.The false negative ratio (FNR) is, consequently, equal to 0.00% in the same datasets.In the fuzzy  The Kohonen SOM network was trained for 100 epochs and the output processed by the k-means algorithm converged in less than 10 iterations in all datasets both for SOMK-C and SOMK-D.

Conclusions
The great diffusion of embedded and portable communication devices on modern vehicles and smart transport systems enable communication with internal and external devices, networks, applications, and services.As vehicle connectivity becomes common, new security risks emerge, since communication protocols, such as CAN network, are still insecure and vulnerable to attacks.For this reason, there is an increasing interest in automotive cybersecurity for in-vehicle communication systems [56].
In this work we propose a distance-based intrusion detection system based on an unsupervised Kohonen SOM network.The SOM network found general applications in security issues because of its features of high detection rate, short training time, and high versatility.
In our work, we introduced the Kohonen SOM network as an intrusion detection system for in-vehicle communication networks aimed at identifying attack messages injected on a CAN bus.In a previous paper we showed the results of the implementation of a hybrid supervised Kohonen SOM network combined with other techniques for clustering in order to improve the performance of the network.In the present work we implemented an unsupervised Kohonen SOM network combined with a k-means algorithm in order to improve the efficiency of the model in detecting attack messages.We proposed a novel distance-based procedure to integrate the unsupervised Kohonen SOM network with the k-means algorithm and compared the proposed method to the traditional procedure.
We tested both the methods on open source data containing four different datasets: DoS attack, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attack.We first used the networks on separate datasets with single kinds of attack, and then merged them all together in a mixed dataset.The mixed dataset presented a highly complex structure to analyze, characterized by large volume of traffic with low number of features and highly imbalanced data distribution with more than 2000 different attack types sent totally randomly.Despite the complex structure of the CAN network dataset, the proposed method showed high performance in detection accuracy-completely succeeding in DoS attack, spoofing the drive gear, and spoofing the RPM gauge datasets.Moreover, it significantly reduced the false negative rate, which has great importance in attack detection for in-vehicle CAN buses in order to ensure the safety of the vehicle.
Due to the availability of the open source data, this work only tested the proposed IDS on a limited number of attacks.Future works should involve investigating the development of the Kohonen SOM network as an anomaly detector by extending it to other types of attacks in a CAN bus.

Future
Internet 2020, 12, x FOR PEER REVIEW 15 of 25

Figure 4 .
Figure 4. SOMK-D algorithm.Algorithm 3.1: SOMK-D algorithm.Input: Input layer consisting of weighted CAN message vectors Xi with i=1, …,n Output: Output Kohonen layer RX containing final codebook vectors Wj associated to neurons Rj with j =1,…,m Results: Compute distances between CAN message vectors Xi and winning neurons BMU in the Kohonen map Wj ⟵ ∅ m ⟵ 4 // set number of neurons for j = 1 : m do Wj ⟵ random(Xi) // random codebook vectors initialization end for  ⟵ 0.05 // set initial learning rate s ⟵ 100 // set initial learning rate n ⟵ 10,000 // set number of CAN messages for t = 1 : s do for i = 1 : n do Xi ⟵ random(Xi) //random selection of a CAN message vector for j = 1 : m do   = ‖  −   ‖ //compute Euclidean distance end for   = min    //compute the winning neuron BMU for j = 1 : m do   =   + ℎ  [  −   ] // update neurons end for end for update() end for for i = 1 : n do   =‖  − W  ‖ // compute distances from CAN message vectors to winning neurons BMU end forThe algorithm is detailed in Algorithms 3 and 4. Results highlight a great improvement in terms of detection accuracy and a significant reduction of the false negative rate compared to the traditional procedure, as shown in Section 6.

Figure 4 .Algorithm 3 . 1 :
Figure 4. SOMK-D algorithm.Algorithm 3.1: SOMK-D algorithm.Input: Input layer consisting of weighted CAN message vectors Xi with i=1, …,n Output: Output Kohonen layer RX containing final codebook vectors Wj associated to neurons Rj with j =1,…,m Results: Compute distances between CAN message vectors Xi and winning neurons BMU in the Kohonen map Wj ⟵ ∅ m ⟵ 4 // set number of neurons for j = 1 : m do Wj ⟵ random(Xi) // random codebook vectors initialization end for  ⟵ 0.05 // set initial learning rate s ⟵ 100 // set initial learning rate n ⟵ 10,000 // set number of CAN messages for t = 1 : s do for i = 1 : n do Xi ⟵ random(Xi) //random selection of a CAN message vector for j = 1 : m do   = ‖  −   ‖ //compute Euclidean distance end for   = min    //compute the winning neuron BMU for j = 1 : m do   =   + ℎ  [  −   ] // update neurons end for end for update() end for for i = 1 : n do   =‖  − W  ‖ // compute distances from CAN message vectors to winning neurons BMU end forThe algorithm is detailed in Algorithms 3 and 4. Results highlight a great improvement in terms of detection accuracy and a significant reduction of the false negative rate compared to the traditional procedure, as shown in Section 6.

Future 25 Figure 5 .
Figure 5. Unsupervised Kohonen SOM network trained on the test set.

Figure 5 .
Figure 5. Unsupervised Kohonen SOM network trained on the test set.

Figure 5 .
Figure 5. Unsupervised Kohonen SOM network trained on the test set.

Figure 6 .
Figure 6.SOMK-C trained on the test set.

Figure 6 .
Figure 6.SOMK-C trained on the test set.

Figure 7
Figure7represents, for each dataset, the distance between each input CAN message vector and its winning neuron BMU computed by the Kohonen SOM network for the training set.Blue and green points in the plots represent, respectively, attack and normal messages.Table5shows distance metrics distinguished for attack and normal messages as classified by the SOMK-C for training set.Plots and related metrics highlight regular patterns and clear separation in distances in DoS, spoofing gear, and spoofing RPM datasets, which were slightly less evident in fuzzy and mixed datasets.In light of this, we tried to improve the efficiency of the Kohonen SOM network in identifying attack messages using a second procedure that takes into account the evident separation in distances.We tried to enhance the separation between attack and normal messages by clustering distances from input CAN message vectors and corresponding winning BMU neurons.With that aim, after computing Euclidean distances between each input CAN message vector and its winning BMU neuron, we clustered them into two groups using the k-means algorithm (SOMK-D).

Figure 7 .
Figure 7. Distances from input CAN message vectors to the winning neurons' Best Matching Unit (BMU) computed by the Kohonen SOM network for the training set.

Figure 7 .
Figure 7. Distances from input CAN message vectors to the winning neurons' Best Matching Unit (BMU) computed by the Kohonen SOM network for the training set.

Figure 8 .
Figure 8. SOMK-D trained on the test set.

Figure 8 .
Figure 8. SOMK-D trained on the test set.

Table 1 .
Machine learning solution in automotive.

Table 2 .
Controller Area Network (CAN) bus 2.0A and CAN bus 2.0B structure.
SOM algorithm.Input: Input layer consisting of CAN message data vectors Xi with i=1,…,n, Output: Output Kohonen layer RX containing final codebook vectors Wj associated to neurons Rj with j=1,…,m, Results: Assign CAN message data vectors Xi to the winning neuron BMU on the Kohonen map set(n, m, s) // set number of messages, number of neurons, number of epochs //compute the winning neuron BMU for j = 1 : m do   =   + ℎ  ()[  −   ] // update neurons

Table 5 .
Distances between input vectors and winning neurons' BMU metrics for attack and normal messages as classified by the SOM network for training set.

Table 6 .
Evaluation metrics for the training sets.

Table 7 .
Evaluation metrics for test sets.