A Kohonen SOM Architecture for Intrusion Detection on In-Vehicle Communication Networks

: The di ﬀ usion of connected devices in modern vehicles involves a lack in security of the in-vehicle communication networks such as the controller area network (CAN) bus. The CAN bus protocol does not provide security systems to counter cyber and physical attacks. Thus, an intrusion-detection system to identify attacks and anomalies on the CAN bus is desirable. In the present work, we propose a distance-based intrusion-detection network aimed at identifying attack messages injected on a CAN bus using a Kohonen self-organizing map (SOM) network. It is a power classiﬁer that can be trained both as supervised and unsupervised learning. SOM found broad application in security issues, but was never performed on in-vehicle communication networks. We performed two approaches, ﬁrst using a supervised X–Y fused Kohonen network (XYF) and then combining the XYF network with a K-means clustering algorithm (XYF–K) in order to improve the e ﬃ ciency of the network. The models were tested on an open source dataset concerning data messages sent on a CAN bus 2.0B and containing large tra ﬃ c volume with a low number of features and more than 2000 di ﬀ erent attack types, sent totally at random. Despite the complex structure of the CAN bus dataset, the proposed architectures showed a high performance in the accuracy of the detection of attack messages.


Introduction
In recent times, cyber security is a key word in the automotive sector. It is necessary on the one hand to preserve the confidentiality, integrity and availability of such systems, and on the other hand, to ensure the safety of people. An example of such a need is visible in smart cities where the development of smart mobility [1,2] and autonomous driving involve different projects [3,4].
Driving and parking assistants, internet connectivity, and in general all the cyber-physical features installed in cars, involve risks in terms of security and privacy. One of many is the possibility to obtain personal information and vehicle data such as fuel consumption, oil level, and handbrake operation.
In the automotive sector, given the critical nature of the software installed on board, in addition to the approaches to continuous software improvement [5][6][7] and software security [8,9], it is necessary to have ad hoc methods and tools capable of managing the complexity [10] and criticality of cyber physical systems that include physical objects equipped with a microprocessor or microcontroller and a memory bank. In this context, the so-called embedded system is the electronic control unit (ECU) [11] that governs various functions inside the vehicle.
The structure of automotive machines is divided into subsystems, each one in charge and separated from the others for each function, some of which have a clear impact on the safety of people on board: power train system; chassis control system; body control system; and infotainment system. By accessing the car's internal communication network, the controller area network (CAN) [12], it is possible to read the data exchanged between the ECUs, modify, replicate and/or block them, thus causing problems of various nature; from minor ones, such as providing the driver with incorrect information regarding some vehicle consumption data, to major ones such as damaging the car brakes by blocking messages before they reach the ECUs [13]. Since the CAN bus protocol does not provide security systems against cyber and physical attacks, and given the increasing number of novel types of attack, it is desirable to have a system to detect any kind of attacks. Traditional intrusion detection systems (IDS), normally used in network security, cannot be directly used for vehicular networks. Due to the high interest in the automotive sector in recent years, an effective and efficient IDS that can work for in-vehicle networks is an important need [14]. The purpose of the paper is to test an anomaly-based IDS implemented using an artificial neural network that, compared to other machine learning approaches, can show remarkable performance in detecting attacks on in-vehicle networks.
Given these premises, we propose a distance-based IDS for the identification of attack messages on an in-vehicle CAN bus. The system has been implemented using the Kohonen self-organizing map (SOM) network, an efficient and fast intrusion detection method that can be trained both as supervised and unsupervised learning. Due to its self-organizational and topological properties, the algorithm has powerful capabilities in the clustering and visualization of complex, highly dimensional data, and has found broad applications in security issues. In particular, the Kohonen SOM network has revealed remarkable performance as an anomaly detector in real-time systems. Compared to other intrusion detection methods, the Kohonen SOM network has shown better performance with a shorter training time and higher detection efficiency [15][16][17]. A recent review of the literature shows that the use of the SOM algorithm opened up new areas of security, which have not been tackled in a similar way before, such as anomaly-based detection [18]. In view of these considerations, we propose to test the Kohonen SOM network on an in-vehicle communication network, since, to the best of our knowledge, it was never tested in such domain.
In our research, we implemented an anomaly-based IDS for the identification of attack messages injected on the CAN bus using both supervised and unsupervised SOM networks [19] combined with a K-means clustering algorithm.
In this paper, we present only the results for the supervised SOM network comparing two different approaches. The first was a supervised X-Y fused Kohonen network (XYF), the second was a supervised XYF Kohonen network combined with a K-means clustering algorithm (XYF-K). The two methods were tested on an open source dataset concerning traffic data messages on a CAN network 2.0B and containing more than 2000 different attack types sent totally at random. The dataset contains a large volume of traffic data messages with a low number of features and highly imbalanced data distribution. Despite the complex structure of the CAN bus dataset, the proposed methods showed a high performance in the accuracy of detection of attack messages. This paper presents the following contributions: • We propose an anomaly-based IDS to detect attack injections on a CAN bus. To the best of our knowledge, this is the first work which tests the Kohonen SOM neural network as IDS for detection on in-vehicle networks. The SOM network can be implemented both as supervised and as unsupervised learning.

•
We tested the proposed IDS on real data concerning in-vehicle network attacks. Data consists in four datasets, each containing a different attack type, in particular, Denial of Service Attack (DoS), spoofing the drive gear, spoofing the Revolution per Minute (RPM) gauge, and fuzzy attacks. Data were obtained by injecting and logging CAN messages on a real vehicle [20]. Moreover, we present the effectiveness of the proposed method showing the performance metrics.

•
Most of the researches proposed in the literature present detection methods able to identify only periodic malicious attacks, but not aperiodic violations [18]. The proposed IDS show a high performance in detecting attacks sent on the CAN bus with any kind of period, also aperiodic attacks. • We compared the results obtained in terms of the system's accuracy to the other studies that tested different IDS approaches on the same dataset. • This is the first work which tests an IDS not only on single datasets containing different attack types but also merging datasets into a unique dataset containing all attack types.
This paper is organized as follows: Section 2 illustrates the related works; Section 3 describes the CAN message structure; Section 4 explains the theoretical framework used in the work; Section 5 shows the experimental process; in Section 6 are pointed out the results and discussions; Section 7 sets out the conclusions.

Related Works
The attackers can maliciously interfere with vehicular communication that occurs both internally and externally to a vehicle [21]. Internal vehicular communications can take place within the in-vehicle network that is based upon the intercommunication of the many ECUs; instead, the external vehicular communications occur when the vehicle connect directly to USBs and maintenance tools. In this research work, we take into consideration the in-vehicle network with the objective of defining a model for the identification of attack messages on an in-vehicle CAN bus.
CAN transmissions are broadcast to all nodes and according to [22], the attacks could replace an authorized ECU program with an illegitimate, malicious program, and connecting to the CAN bus using an unauthorized device. Therefore, the attackers can perform a Masquerading Attack i.e., masquerades as a legitimate node because the CAN frame are not encrypted and it does not support message authentication [21]; Eavesdropping Attack, the attackers can intercept the CAN transmissions and identify patterns in valid CAN frames [23]; Injection Attack to inject false messages into an automotive bus system; Replay Attack, where valid frames are sent continuously in order to impede the vehicle's real-time functioning; Denial of Service Attack (DoS), which occurs when an attacker continually sends high priority messages that block legitimate low priority messages [21]; Distributed Denial of Service (DDoS) Attack, the attackers launch an attack from different locations and is more severe in the vehicular environment because the mechanism of the attack is in a distributed manner, where the impact is dispersed in the network [24]; Bus-off Attack, a type of DoS attack that exploits CAN error handling (bit error, stuff error, CRC error, form error, ACK error) to intentionally shift the victim node to the bus off state. The attacker causes a bit error of the victim node and a message collision by transmitting the frame at the same timing as the victim node. This is repeated every time the victim node sends a message, and puts the victim node into the bus-off state [25]; Fuzzy Attack, that is aimed to make the system confused thanks to the manipulation of normal CAN ID and data fields [26]; Impersonate attack is aimed to make legitimate engine speed signal (RPM) and gear components behave abnormally [26,27].
In order to identify attacks on in-vehicle communications, different intrusion detection systems have been developed. The categories of IDS used in the automotive domain comprise anomaly, specification, signature and hybrid-based techniques. Anomaly and specification-based methods detect anything that deviates from a normal behavior, but, in the first approach, normal behavior is learned in a training phase; in the second approach, specifications and rules need to be defined manually by a human expert. The signature-based approach detects attacks by using a set of previously defined signatures, rules and malicious events. Meanwhile, the hybrid-based approach combines several IDS techniques. Due to the characteristics of the CAN protocol, which can frequently be modified, an anomaly-based approach using the machine learning technique can be a suitable detection method. Indeed, it allows the IDS to learn from examples and, thus, to adapt to any change in the protocol with no need of the rules previously defined.
As a result, different machine learning techniques have been used, for example, neural network classifiers, Bayesian classifiers, Bayesian regularized neural networks, naive Bayes, support vector machines (SVM), and self-organization maps (SOM) [28,29].
In [30], an intrusion detection system (IDS) based on a deep convolutional neural network (DCNN) is proposed to protect the CAN bus of the vehicle. The DCNN learns the network traffic patterns and detects malicious traffic without hand-designed features. The proposed model demonstrated a good detection performance against message injection attacks. However, it has a fundamental limitation in detecting the unlearned types of attacks because it is based on supervised learning. Instead, in [31], the authors used an unsupervised deep neural network (DNN) to enhance the security. They trained the parameters of the DNN with probability-based feature vectors extracted from the in-vehicular network packets by using an unsupervised pre-training method of deep belief networks, followed by the conventional stochastic gradient descent method. The DNN provides the probability of each class to discriminate normal and hacking packets, and thus, the system can identify any malicious attack to the vehicle as a result.
DNN is also implemented in [32] treating anomaly detection as a cross-domain modelling problem. The proposed technology provides real-time responses to anomalies and attacks to the CAN bus, and improves the detection ratio.
Instead, in [33] a novel neural network architecture, CANet, is presented. It is trained in an unsupervised manner to detect intrusions and anomalies on the CAN bus. The risks highlighted by the authors are that neural networks require a large amount of training data and are typically more computationally expensive than many other approaches, and in a security setting, they can be sensitive to adversarial attacks. Additionally, in [26], an anomaly detection method using an unsupervised deep learning-based approach, deep contractive autoencoders (DCAEs), was developed. They imposed a different penalty term to the CAN data representation in order to encourage robustness towards small changes.
After analyzing the attacks and the different in-vehicle CAN bus solutions developed to identify them, our research focused on the implementation of an IDS based on a supervised Kohonen SOM network for the identification of attack messages, considering the following types: DoS, spoofing gear, spoofing RPM, and fuzzy.
The Kohonen SOM is a network that can be trained both as a supervised and an unsupervised learning network. There are different works that show the validity of this approach in IDSs. In [34], a SOM network was performed for real-time textual classification in order to filter web contents. A SOM network was also used for real-time intrusion detection in home appliances interconnected by a gateway [35]. Moreover, a SOM network was performed for real-time credit card fraud detection in order to analyze customer behavior [36]. The network showed high capabilities in identifying hidden patterns otherwise difficult to understand. In [37], a self-organizing feature map neural network combined with a K-means clustering algorithm (KSOM) was proposed to improve the efficiency of the intrusion detection network by leveraging the characteristic of the K-means clustering algorithm, such as a faster clustering speed. The model was tested on the NSL-KDD dataset, a classic well known IDS dataset, containing four attack categories and approximately 40 different attack types, and a high number of features, equal to 41, for each connection record. Experimental results showed that the iterations of the KSOM algorithm were significantly reduced compared with the SOM network model and the accuracy was significantly improved with respect to other methods of intrusion detection network.
Taking into account the strong capabilities of the SOM network as an anomaly detector in real-time systems, Kohonen SOM networks have been applied to many different areas of security [18] but have never been performed on an in-vehicle communication network.
The contribution of our research is to introduce a distance-based IDS using a Kohonen self-organizing map network integrated with a K-means clustering algorithm on a CAN bus for identifying attack messages on the bus. Results and discussions for the supervised SOM are presented in this paper. We compared two different approaches. In particular, we first performed a supervised X-Y fused Kohonen network (XYF) and then combined the XYF Kohonen network with a K-means clustering algorithm (XYF-K) in order to improve the efficiency of the intrusion detection network. The two procedures were tested on an open source car hacking dataset containing normal and attack messages in order to detect the attack messages injected on the bus. The dataset was characterized by a low number of features, just five for each message, and four attack categories. It contained messages sent on the CAN bus with different periods that were the time elapsed between sending two messages of the same type. The period could be regular, with a minimum interval time or a totally arbitrary time. The dataset presented a quite complex structure to analyse since it contained a high number of different attack types, more than 2000, with just one occurrence, sent totally at random. Olufowobi et al. [38] implemented an IDS based on a supervised learning approach and tested the model using the described dataset but only including the messages sent at regular or with a minimum interval time. We included in the analysis all kinds of messages and despite the complex structure of the CAN bus dataset, the proposed methods showed a high performance in identifying even the attack messages sent totally at random at arbitrary times on the CAN bus.

Control Area Network (CAN)
The CAN bus is a message-based protocol designed for in-vehicle networks. It enables numerous electronic components, throughout the in-vehicle system, to communicate with each other through a single/dual-wire bus [29]. Consequently, all the nodes in a CAN bus can receive all packet transmissions and a frame is defined as a structure [30]. Figure 1 shows the standard structure, a sequence of CAN data (bytes) in the network. (XYF) and then combined the XYF Kohonen network with a K-means clustering algorithm (XYF-K) in order to improve the efficiency of the intrusion detection network. The two procedures were tested on an open source car hacking dataset containing normal and attack messages in order to detect the attack messages injected on the bus. The dataset was characterized by a low number of features, just five for each message, and four attack categories. It contained messages sent on the CAN bus with different periods that were the time elapsed between sending two messages of the same type. The period could be regular, with a minimum interval time or a totally arbitrary time. The dataset presented a quite complex structure to analyse since it contained a high number of different attack types, more than 2000, with just one occurrence, sent totally at random. Olufowobi et al. [38] implemented an IDS based on a supervised learning approach and tested the model using the described dataset but only including the messages sent at regular or with a minimum interval time. We included in the analysis all kinds of messages and despite the complex structure of the CAN bus dataset, the proposed methods showed a high performance in identifying even the attack messages sent totally at random at arbitrary times on the CAN bus.

Control Area Network (CAN)
The CAN bus is a message-based protocol designed for in-vehicle networks. It enables numerous electronic components, throughout the in-vehicle system, to communicate with each other through a single/dual-wire bus [29]. Consequently, all the nodes in a CAN bus can receive all packet transmissions and a frame is defined as a structure [30]. Figure 1 shows the standard structure, a sequence of CAN data (bytes) in the network. The meaning of each field in a CAN frame is: SOF (start of frame), which specifies the start of a new message with a dominant bit to synchronize all the nodes for CAN message transmission; Arbitration includes 11 bits and can be extended up to 29 bits. The ID field for each transmitted CAN frame indicates the packets' priority. The lower the ID bits, the higher the priority of the packet the value signifies, and the single remote transmission request (RTR) bit is dominant when information is required from another node; Control provides information in order to verify that all the expected packets are received successfully by the receiver. It consists of three fields: the Identifier Extension (IDE) must be recessive for extended format and dominant for standard format, Reserved Bit (RB0) must be dominant for standard format, and DLC (data length code) contains the number of data bytes; Data contains information for CAN nodes to perform actions; CRC (cyclic redundancy check) contains the checksum of the previous data for error detection and in CRC Delimiter (CRC Del) the single bit must be recessive; ACK (acknowledge) ensures that all the nodes involved in the message receive everything correctly, and in case of error, that the transmitter is immediately alerted to send the data packets again. In ACK Delimiter (ACK Del) the single bit must be recessive; EOF (end of frame) indicates the end of a current CAN Message.

Unsupervised Kohonen Network
The Kohonen self-organizing map is a type of artificial neural network (ANN). This technique allows the visualization of high dimensional data reducing the dimensions of data to a map. The Kohonen SOM is a nonlinear mapping network aimed at detecting the similarities among the data in the input layer and at representing them in an output layer of interconnected neurons. Input data are assigned to neurons according to the defined measures of distance between them [39]. The meaning of each field in a CAN frame is: SOF (start of frame), which specifies the start of a new message with a dominant bit to synchronize all the nodes for CAN message transmission; Arbitration includes 11 bits and can be extended up to 29 bits. The ID field for each transmitted CAN frame indicates the packets' priority. The lower the ID bits, the higher the priority of the packet the value signifies, and the single remote transmission request (RTR) bit is dominant when information is required from another node; Control provides information in order to verify that all the expected packets are received successfully by the receiver. It consists of three fields: the Identifier Extension (IDE) must be recessive for extended format and dominant for standard format, Reserved Bit (RB0) must be dominant for standard format, and DLC (data length code) contains the number of data bytes; Data contains information for CAN nodes to perform actions; CRC (cyclic redundancy check) contains the checksum of the previous data for error detection and in CRC Delimiter (CRC Del) the single bit must be recessive; ACK (acknowledge) ensures that all the nodes involved in the message receive everything correctly, and in case of error, that the transmitter is immediately alerted to send the data packets again. In ACK Delimiter (ACK Del) the single bit must be recessive; EOF (end of frame) indicates the end of a current CAN Message.

Unsupervised Kohonen Network
The Kohonen self-organizing map is a type of artificial neural network (ANN). This technique allows the visualization of high dimensional data reducing the dimensions of data to a map.
Appl. Sci. 2020, 10, 5062 The Kohonen SOM is a nonlinear mapping network aimed at detecting the similarities among the data in the input layer and at representing them in an output layer of interconnected neurons. Input data are assigned to neurons according to the defined measures of distance between them [39].
Unlike other techniques in neural networks, SOMs can classify data without supervision, that is to say the model does not require any target variable. It is an unsupervised learning model and differs from other artificial neural networks because it cannot adjust the weights using back propagation methods for error-correction learning. Instead, the SOM network has a feed-forward structure with a single computational layer [40] and uses competitive learning in order to find similarities among the data, clustering them into different classes of data [41]. The Kohonen neural network maps input data on a map of neurons according to spatial constraints.
The Kohonen output network consists of a competitive layer where a n-dimensional codebook vector is assigned to each neuron in the map and the elements of the vector represent the weights [42]. The spatial constraints also imply that neighboring neurons have similar codebook vectors. If two high dimensional input data vectors are similar, then they will be mapped in nearby neurons on the network grid. Hence, input data vectors mapped on the same neuron or in the neighboring ones are similar. SOM networks try to reproduce the topological order of input data through clusters of neurons and neighbors whose number is defined by the size of the map [43].
The SOM algorithm computes similarities between each input data vector and the neurons codebook vectors in order to find the most similar. This neuron is the winner, called the best matching unit (BMU), and it updates its weights vector using a weighted average in order to better match with the input vector. The weight of the attraction of the BMU to the input data vector changes at each iteration and it is one of the training parameters of the model also called learning rate α. The learning rate α decreases during the training process ensuring the convergence of the model [43]. Moreover, the neighboring neurons update their weight vectors moving closer to the input and ensuring the spatial constraints in order to preserve the topology of the map.
A simple description of the algorithm can be explained as follows. After defining the number of neurons in the Kohonen SOM network, a codebook vector is initialized for each neuron. Then, the Euclidean distance is computed between a CAN message vector randomly selected and all the neurons in the Kohonen layer. The CAN message vector is assigned to the neuron with the minimum value of distance, called the winning neuron or best matching unit (BMU). The neurons with the most similar codebook vectors are identified as neighboring neurons. The BMU and the neighboring neurons are updated in order to better match to the CAN message vector. This procedure is repeated for all the CAN message vectors and for a given number of iterations.
Two main algorithms are used to train the SOM network: the online and the batch algorithms [44]. The difference is that in the online algorithm, the BMU and the neighboring neurons are updated immediately after an input data vector is presented to the network, whereas in the batch algorithm, codebook vectors are updated after all the input data vectors are presented to the network [45].
Formally, we defined the input data vectors X i , with i = 1, . . . ,n, the number of Kohonen neurons R j , with j = 1, . . . ,m, and set the number of iterations t with t = 1, . . . ,z [46]. The number of iterations in the training process is expressed in epochs, where one epoch indicates the steps of the training algorithm which allows one complete presentation of all the input data vectors to be learned by the network. The online training algorithm of the SOM can be implemented using a stepwise recursive procedure detailed in Algorithm 1.
Many different measures of similarity can be used to measure the distance between the input data vectors X i and the neurons R j codebook vectors [43] such as Manhattan, Tanimoto, Bray Curtis, Canberra, Chebyshev distances. However, Euclidean distance normally gives slightly better classification results [47]. We define R c (t) as the winning neurons R BMU with the minimum Euclidean distance from the input data vector X i (t) at the iteration t where: R c (t) is the neuron which best matches the input vector X i (t) in the grid at iteration t [48]. Once the winning neuron is found, the neuron and its spatial neighbors are updated by the following: where h jc (t) is the neighborhood function. The rate of change at different neurons around the winning neuron depends on the mathematical form of the neighborhood function. This function has a very central role in the SOM networks since it preserves the topological properties of the inputs. A variety of neighborhood functions can be used, but the most applied in SOM neural networks is the Gaussian neighborhood function: where α(t) is the learning rate function which is a function monotonically decreasing at each iteration t and σ(t) expresses the width of the neighborhood function also decreasing with the training process. r j is the position of neuron j.
A variant of the traditional online SOM algorithm is the batch algorithm where neurons' codebook vectors are adjusted only after all the input data vectors X i in the input layer are assigned to their winning neurons R BMU in the network. At iteration t, the codebook vectors of winning neuron and neighbors' neurons are updated as follows: where c (i) is the index of the winning neuron R BMU for the input data vector X i and n is the number of the input data vectors.

Algorithm 1 Kohonen SOM algorithm
Input: Input layer consisting of CAN message data vectors X i with i = 1, . . . ,n, Output: Output Kohonen layer R X containing final codebook vectors associated to neurons R j with j = 1, . . . ,m, Results: Assign CAN message data vectors X i to the winning neuron R BMU on the Kohonen map set(n, m, z)//set number of messages, number of neurons, number of epochs R j ← ∅ for j = 1, . . . ,m do R j ← random(X i )//random codebook vectors initialization end for set(α) //set initial learning rate while t < z do for i = 1: n do

Supervised Kohonen Network
The Kohonen neural network is broadly known as an unsupervised learning model, although it can be used as a supervised classifier. Further information, such as class information, can be entered in the network with the aim of modeling the relationship between the input data and a dependent variable for predictive purposes [45].
In the supervised SOM networks, two layers are presented to the network, one for the input data and one for the dependent variable. We define X i and Y i , with i = 1, . . . ,n, respectively, the input data and the dependent variable vectors, and R X,j and R Y,j , with j = 1, . . . ,m, the Kohonen neurons, respectively, in the R X and in the R Y map. The dependent variable Y layer has the same topology and the same numbers of neurons in the grid as the input data layer X, but a different number of weights in the codebook vectors. The neurons codebook vectors R X,j and R Y,j have the same number of weights, respectively, of the input X i and the output Y i data vectors.
The supervised Kohonen network can be implemented following different approaches [49]. The training process is the same as the unsupervised network.
The first approach is based on the counter propagation network (CPN), similar to the unsupervised network. The input data X i and the dependent variable Y i vectors are trained simultaneously but separately. The dependent variable layer Y does not influence the training process.
In the X-Y fused (XYF) Kohonen approach, the network is trained on a concatenated layer of input data X i and dependent variable Y i vectors. A fused similarity measure is computed as a weighted combination of separate distances between the input data vectors X i to the neuron codebook vectors R X,j and the corresponding dependent variable Y i vectors to the neuron codebook vectors R Y,j . R X,j and R Y,j are updated simultaneously. At the iteration t, the fused similarity measure D(i, j) for the observation (X i , Y i ) can be expressed as follows: where ϕ(t) is a weight which linearly decreases with iterations. The winning neuron is the one with the minimum D(i, j). The bidirectional Kohonen (BDK) approach uses the same training procedure of the XYF network with the difference that, first, only the codebook vectors R X,j in the R X grid are updated. Then, the codebook vectors R Y,j in the R Y grid are adjusted in a bidirectional way instead of a simultaneous way.
At the end of the training process, the output of the supervised Kohonen network consists of two separate neuron maps: R X for the input data X and R Y for the one dependent variable Y. The prediction for a new object is determined by first computing the winning neuron R X,BMU in the R X grid and then assigning to the object the class of the corresponding neuron R Y,BMU in the R Y grid.

The Proposed Methodology
In this paper, we propose an anomaly-based IDS for the identification of attack messages on an in-vehicle CAN bus. Firstly, we implemented a supervised XYF Kohonen approach to classify attack and normal messages sent on the CAN bus, obtaining good performance in detecting attack messages. The XYF Kohonen network is a power tool for intrusion detection, but it achieves its best performance when combined with other approaches, in particular, clustering algorithms [18]. The K-means algorithm is a clustering algorithm [50], which helps to achieve local clustering and shows high performance in dealing with big datasets. Previous works show how the SOM network coupled with the K-means clustering algorithm achieves better local optimal clustering results, improving the detection rate [51]. Thus, secondly, we implemented a hybrid approach combining the supervised XYF Kohonen network with the K-means clustering algorithm in order to improve the performance of the model in correctly identifying the attack messages injected on the CAN bus. In this approach, the clustering problem is not solved by the XYF network that only calculates the codebook vectors assigned to each neuron. The codebook vectors are, then, used as an input of the K-means algorithm for local clustering in order to achieve clustering results. This approach enables solving a lack of the XYF network in correctly classifying some observations, improving the detection accuracy and reducing the false negative rate. This measure has great importance in attack detection for in-vehicle CAN buses since undetected attacks can undermine vehicle safety.

Experimentation
The research goal addressed in this paper is to implement an anomaly-based IDS for the identification of attack messages on an in-vehicle CAN bus. To this aim, we compared two methods on four different CAN bus attack datasets. The first method was a supervised XYF Kohonen network, and the second was a supervised XYF Kohonen network combined with a K-means clustering algorithm in order to improve the performance of the model and correctly identify the attack messages injected on the CAN bus.
The research goal and research questions were defined according to the goal-question-metric (GQM) paradigm [52,53] as follows: • M1: Given a car hacking dataset, the efficiency of the XYF and the XYF-K algorithms was tested computing the following classification metrics: accuracy, detection rate, false positive rate (FPR), precision, and F1-score. M2: Given a car hacking dataset, the accuracy of the XYF-K algorithm is higher or, at least, equal compared to the accuracy of other IDS approaches, also including the analysis of periodic and aperiodic CAN messages.
For this aim, the experimentation was performed in three main steps: Experimental process: implementation of the XYF Kohonen network and of the XYF-K algorithm first on each dataset separately and then merging all of them in a single mixed dataset including all kinds of attack. The network was performed on random samples of 10,000 split into the training set and test set, for each dataset and for the mixed one. The training set was used to train the network and the test set for prediction. 3.
Evaluation metrics [54]: evaluation of the performances of the two models.

Data Description
We analyzed the open source car hacking data which included four datasets: DoS attack, spoofing the drive gear, spoofing the RPM gauge, and fuzzy attack. The datasets were created by logging the CAN network 2.0B traffic via the OBD-II port from a real vehicle, while message injection attacks were performing [27,30]. Each dataset contained 300 intrusions of message injection for 3 to 5 s time lapse with a total of 30 to 40 min of CAN traffic.
In particular, in the DoS attack database attack messages of 29 zero bits CAN ID, the most dominant in the CAN bus were injected every 0.3 milliseconds; the aim was the tampering of the CAN network accessibility. Since the most dominant CAN ID wins the bus, all other ECUs are precluded to access the network and send messages.
In the spoofing gear/RPM databases, the attack messages of CAN IDs related to gear/RPM information were injected every 1 millisecond. The injected messages transmitted information about the driver gear and the RPM gauge, respectively, changing the original status on the instrument panel. Data were labeled for normal and attack traffic data.
In the fuzzy attack database, the attack messages were similar to the DoS attacks, but CAN IDs and DATA values were injected totally at random every 0.5 milliseconds impairing the vehicle functionality.
An overview of the dataset is shown in Tables 1 and 2.
All the datasets have the same structure and include: timestamp, recorded time in seconds, CAN ID, identifier of CAN message in HEX, DLC, number of data bytes from 0 to 8, DATA [0~7], data values (byte), and labels T or R where T represents the injected attack messages while R represents the normal messages.

Data Preprocessing
Before running the model, we first preprocessed the data. A CAN message is characterized by its period and by its ID, which represents a unique identifier and expresses the message's priority. Data processed in the network were the CAN ID identifier, data values DATA [0~7], DLC, period, and label.
We define some notations: at each timestamp S i with i = 1, . . . ,n, where n is the total number of messages, an ECU sends a message to the CAN bus. Let I k = {I 1 , . . . ,I r } be the set of all CAN ID identifiers in the dataset. For each I k , with k = 1, . . . ,r, we have a set of D Ik = {D Ik 1 , . . . , D Ik s } data values, and s can take different values in each set D Ik .
CAN ID identifiers and data values were regarded with a semantic approach considering each CAN ID identifier I k as a class of messages sent by an ECU and the data values D Ik as the related value information [32]. Since the Kohonen network works by using numerical data, we used the one-hot encoding technique to turn categorical data into a matrix representation. In particular, each CAN ID unique identifier was represented by one column with value 1 if the CAN message instance belongs to that CAN ID, or otherwise 0 ( Table 3). The data values DATA [0~7] were merged in a single string representing the value information related to a particular CAN ID message. This new categorical variable was also turned into a matrix representation using the one-hot encoding technique. We did not process the CAN message data in a sequential way but with respect to their period. CAN messages can be periodic, sporadic, or aperiodic. Periodic messages are sent at regular time intervals, sporadic messages occur with a minimum time interval and aperiodic messages at arbitrary times [19]. Hence, starting from timestamp defined as S i , with i = 1, . . . ,n, we calculated a new variable T whose elements T i , with i = 1, . . . ,n, express the time in milliseconds between two successive occurrences with the same CAN ID identifier. The dataset contained a high number of periodic, sporadic and even aperiodic messages with different CAN IDs sent at different times and with different frequencies. They all were included in the analysis.
Finally, we normalized all the numerical variables in order to avoid bias in the training process that can be generated when dealing with very large input vectors [42]. Normalization was obtained using a linear transformation to scale the numerical variables to have values between 0 and 1.
X represents the input layer processed in the Kohonen network where X i is the i-th input CAN message data vector (Table 3).
A second layer Y, given by the dependent variable label, was presented to the network. The variable label, since it is a categorical variable with the classes "Attack" and "Normal", was also transformed into a matrix with two columns, one for each class, and n rows of 0 or 1, one for each input data vector ( Table 3).
The DoS and spoofing the gear/RPM datasets present a quite regular structure. Normal messages are sent on the CAN bus using 26 different CAN ID identifiers. In a DoS attack, messages are only sent using one different CAN ID, in spoofing the gear/RPM, one of the 26 CAN IDs that send normal messages also send attack messages. These three datasets have a quite simple structure and the proposed method tested on them completely succeeded in detecting the attack messages. The fuzzy dataset shows a more complex structure, since the messages are sent with 2017 different CAN IDs, 37 of which also send normal messages. Moreover, the messages sent using the 97.1% of unique CAN IDs show a relative frequency less than 0.1% and are transmitted totally at random. Hence, the mixed dataset, obtained merging the four datasets, presents a higher complexity for the purpose of the analysis. The number of CAN ID identifiers is shown in Table 4.

Experimental Process
After preprocessing the data, we performed a supervised XYF Kohonen network with the aim of configuring a network able to correctly identify attack messages injected on the CAN bus. The goal was to assign attack messages and normal messages to different neurons on the map since the Kohonen network provides a distance-based output assigning input data vectors based on their global and local similarity [39].
We used a process of trial and error to define the parameters of the supervised XYF Kohonen network. In large maps, searching for winner neurons and updating the information are time-consuming operations. If the neurons in the network are initialized with weight vectors close to the final codebook vectors, the convergence of the model requires to be at least an order of magnitude faster [48]. Starting the training process with a small map allows to calibrate the network in a low number of epochs, even with an unstable convergence. The codebook vectors assigned to the neurons in this first training of the network can be used to properly initialize a larger map that will converge in a longer training process but reducing time consumption. Accordingly, we started to train the XYF network on a 2 × 2 hexagonal map.
We performed the XYF network using the online SOM algorithm which returned higher levels of accuracy than the batch algorithm trained on a 2 × 2 neurons map for a low number of epochs equal to 100. The initial value of the learning rate α was set to 0.5, linearly decreasing to 0.001 at the end of the training. The weights ϕ parameter of the fused similarity measure D(i, j) was set equal to 0.1 for the R X map and equal to 0.9 for the R Y map in order to put more emphasis on the labels variable Y and enforce the spatial clustering of the attack and normal messages. The Gaussian neighborhood function was used to select the neighboring neurons around the winning neuron. The similarity between the input message data vectors and the SOM grid neurons was computed using the Euclidean distance since it retuned the best results.
The network perfectly separated the attack and normal input message data vectors in two distinct neurons in a low number of epochs. Results are shown in Section 6.
After training the XYF network, we expanded the grid introducing in the 2 × 2 neurons hexagonal map new interstitial nodes to form a larger map. In particular, we doubled in two times the size of the original array obtaining, first, a 4 × 4 neurons and then, an 8 × 8 neurons hexagonal map. The XYF networks on the expanded maps were performed initializing the neurons codebook vectors in the R X and R Y maps using the neurons codebook vectors returned in output by the XYF network previously trained. The networks were trained this time using the batch algorithm, since it is faster and computationally more efficient for longer training process. In fact, neurons' codebook vectors are updated only after the complete dataset has been presented to the network [45]. In order to speed up the process, the networks were run using a parallel implementation of the process on multiple cores. The XYF networks trained on both the expanded maps greatly improved the performance and converged in a training process of 500 epochs. Results are shown in Section 6. The XYF networks trained on different map sizes were used to predict the new input message data vectors.
The XYF network showed a powerful performance in classifying the attack and normal input data messages. Nevertheless, we combined the output of the supervised XYF Kohonen network with a K-means clustering algorithm (XYF-K) in order to improve the performance of the model in correctly identifying the attack messages injected on the CAN bus. The codebook vectors of neurons R X,j in the R X map at the end of the Kohonen training procedure were used as input vectors of the K-means algorithm in order to divide the neurons in two disjoint clusters, normal and attack ( Figure 2). Finally, input data message vectors assigned to the neurons belonging to the normal cluster, will be classified as normal messages, input data message vectors assigned to the neurons belonging to the attack cluster will be classified as attack messages. The number of clusters was set equal to 2 in order to cluster neurons and the associated CAN messages in two groups, attack and normal. The initial centroids were generated randomly resampling the initial starting point 30 times, in order to make the k-means algorithm converge to a global optimum. The initial centroids with the lowest sum of squares were finally chosen. The algorithm converged setting a maximum number of iterations equal to 30.
The Kohonen neural network was implemented using the R project available at the repository: http://cran.r-project.org. Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 27

Evaluation Metrics
In order to evaluate the performances of the supervised Kohonen network implemented on different datasets, we computed the traditional classification metrics as accuracy, detection rate, false positive rate (FPR), precision, and F1-score [55].
We also computed the false negative rate (FNR) which represents the fraction of undetected attacks as follows: where TP (true positive) and TN (true negative) are the number of messages correctly classified, respectively, as attack or normal, and FP (false positive) and FN (false negative) are the number of messages incorrectly classified, respectively, as attack or normal.
Since even a very small number of undetected attacks can determine tampering in the vehicle, undermining its safety, the FNR measure has great importance in attack detection for in-vehicle CAN buses and its value should be very small.

Experimental Results and Discussion
As explained in Section 5, in the experimentation we performed two methods, the supervised XYF Kohonen network and the supervised XYF-K algorithm, on the DoS, spoofing gear/RPM, and fuzzy datasets in order to identify attack messages on the CAN bus. We processed the samples of 10,000 records, first, for each single dataset separately, then, for the mixed dataset including all kinds of attacks. The samples were randomly selected balancing the ratio of CAN ID messages and were randomly split into 80% training and 20% test set, again balancing the ratio of CAN ID messages both in training and in test set for all the datasets with the exception of the fuzzy dataset. The fuzzy dataset was split balancing the ratio of labels attack/normal because of the high number of different CAN ID messages (2017), the 97.1% of whom show a relative frequency less than 0.1% in the dataset and are transmitted totally at random. The training set was used to train the XYF Kohonen network and the XYF-K algorithm and the test set to evaluate the models' performance.
We started training the XYF Kohonen network on a 2 × 2 neurons hexagonal map. The network assigned a codebook vector to each neuron in the RX and RY grids. The element of the neuron's codebook vector with the maximum value in the RY map determines the label of that neuron. The neurons in the RX map assume the labels of the neurons located in the same position in the RY map.

Evaluation Metrics
In order to evaluate the performances of the supervised Kohonen network implemented on different datasets, we computed the traditional classification metrics as accuracy, detection rate, false positive rate (FPR), precision, and F1-score [55].
We also computed the false negative rate (FNR) which represents the fraction of undetected attacks as follows: where TP (true positive) and TN (true negative) are the number of messages correctly classified, respectively, as attack or normal, and FP (false positive) and FN (false negative) are the number of messages incorrectly classified, respectively, as attack or normal.
Since even a very small number of undetected attacks can determine tampering in the vehicle, undermining its safety, the FNR measure has great importance in attack detection for in-vehicle CAN buses and its value should be very small.

Experimental Results and Discussion
As explained in Section 5, in the experimentation we performed two methods, the supervised XYF Kohonen network and the supervised XYF-K algorithm, on the DoS, spoofing gear/RPM, and fuzzy datasets in order to identify attack messages on the CAN bus. We processed the samples of 10,000 records, first, for each single dataset separately, then, for the mixed dataset including all kinds of attacks. The samples were randomly selected balancing the ratio of CAN ID messages and were randomly split into 80% training and 20% test set, again balancing the ratio of CAN ID messages both in training and in test set for all the datasets with the exception of the fuzzy dataset. The fuzzy dataset was split balancing the ratio of labels attack/normal because of the high number of different CAN ID messages (2017), the 97.1% of whom show a relative frequency less than 0.1% in the dataset and are transmitted totally at random. The training set was used to train the XYF Kohonen network and the XYF-K algorithm and the test set to evaluate the models' performance.
We started training the XYF Kohonen network on a 2 × 2 neurons hexagonal map. The network assigned a codebook vector to each neuron in the R X and R Y grids. The element of the neuron's codebook vector with the maximum value in the R Y map determines the label of that neuron. The neurons in the R X map assume the labels of the neurons located in the same position in the R Y map. Figures 3-7 show the results of the XYF Kohonen network, respectively, for the DoS, spoofing gear/RPM, fuzzy and mixed datasets. In addition, these results show that the supervised Kohonen SOM network can be used as an anomaly detector for an in-vehicle communication network (RQ1).
In the following discussion, we will refer only to the DoS dataset in Figure 3, since analogous considerations can be extended to the other datasets.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 27  show the results of the XYF Kohonen network, respectively, for the DoS, spoofing gear/RPM, fuzzy and mixed datasets. In addition, these results show that the supervised Kohonen SOM network can be used as an anomaly detector for an in-vehicle communication network (RQ1).
In the following discussion, we will refer only to the DoS dataset in Figure 3, since analogous considerations can be extended to the other datasets.  In the following discussion, we will refer only to the DoS dataset in Figure 3, since analogous considerations can be extended to the other datasets.        Figure 3a shows the 2 × 2 hexagonal map where the neurons which assume the same label have the same color, in particular, yellow for normal and red for attack. Input message data vectors Xi assume the same label of the assigned winning neurons. Figure 3a highlights that input data messages that are clearly separated by the XYF network in the training phase in two distinct neurons.
In the XYF Kohonen network, the prediction for the unknown data messages Xi is determined presenting the new object to the trained network. The network assigns the unknown message Xi to a neuron in the RX map and classifies it as the normal or attack message according to the label of the neuron allocated in the same position as the RY map.
In the training process, the input data vectors are assigned to two of the four neurons, whereas two of the four neurons in the map are empty (Figure 3a). Prediction for the training set correctly classifies all the input data message vectors with an accuracy of 100% in all the datasets, since the network assigns them to the same neurons as in training process. Predictions for the test set are less accurate (Table 5) since the input data message vectors are also assigned to the neurons that were empty in the training process (Figure 3b). The network was not able to label these observations with a loss in accuracy.   Figure 3a shows the 2 × 2 hexagonal map where the neurons which assume the same label have the same color, in particular, yellow for normal and red for attack. Input message data vectors X i assume the same label of the assigned winning neurons. Figure 3a highlights that input data messages that are clearly separated by the XYF network in the training phase in two distinct neurons.
In the XYF Kohonen network, the prediction for the unknown data messages X i is determined presenting the new object to the trained network. The network assigns the unknown message X i to a neuron in the R X map and classifies it as the normal or attack message according to the label of the neuron allocated in the same position as the R Y map.
In the training process, the input data vectors are assigned to two of the four neurons, whereas two of the four neurons in the map are empty (Figure 3a). Prediction for the training set correctly classifies all the input data message vectors with an accuracy of 100% in all the datasets, since the network assigns them to the same neurons as in training process. Predictions for the test set are less accurate (Table 5) since the input data message vectors are also assigned to the neurons that were empty in the training process (Figure 3b). The network was not able to label these observations with a loss in accuracy. Since some input message data vectors X i were not classified by the XYF network, the only evaluation metric we computed was the accuracy calculated as the sum of the TP and TN over the total number of new observations presented to the network. Other metrics could return a biased representation of the network predictions.
Although the XYF network shows a high performance in predicting attack and normal messages, this lack in classifying some observations can be overcome implementing the XYF-K algorithm. After training the XYF Kohonen network, the output was subsequently processed with the K-means clustering algorithm. The neurons codebook vectors of the R X map were used as input for the K-means algorithm, which clustered neurons in two distinct groups, defining neurons labels as normal and attack. When an unknown data message X i is assigned by the XYF network to the winning neuron in the R X map, it is classified as normal or attack according to the clusters defined by the K-means algorithm. Figure 3b,c highlights that the K-means algorithm also clusters empty neurons, classifying all new input data messages presented to the network. The XYF-K algorithm improves the accuracy of the XYF Kohonen network trained on a 2 × 2 neurons map in all the datasets (Table 5). In the DoS spoofing gear and RPM datasets, the XYF-K algorithm succeeded in predicting all the new data messages with an accuracy of 100%.
In order to stabilize convergence of the XYF Kohonen network, as better explained in Section 5, we expanded the 2 × 2 neurons hexagonal map introducing new interstitial nodes and obtaining first a 4 × 4 and then an 8 × 8 neurons hexagonal map. The XYF networks were trained on the expanded grids initialing the neurons codebook vectors in the R X and R Y maps using the neurons codebook vectors returned in output by the previously trained XYF network.
The XYF network trained using the 4 × 4 neurons map greatly improved the performance of the model in all the datasets (Figure 3e). Although the high number of periodic, sporadic and even aperiodic messages with different CAN IDs were sent at different times and with different frequencies, the network returned with an accuracy of 100% in identifying attack messages in the DoS, spoofing gear and RPM datasets (Table 5).
Training the XYF network with the 8 × 8 neurons map, we can see a slight improvement in accuracy only in the fuzzy dataset. Despite the presence of messages with 2017 different CAN IDs sent totally at random in the CAN bus, the network achieved an accuracy of 99.70%. In the DoS and spoofing gear datasets, the accuracy was the same, and in the spoofing RPM and mixed datasets, it was even worse than in the network trained using the 4 × 4 neurons map ( Table 5).
The XYF-K algorithm was also implemented on the expanded neurons maps. The XYF-K algorithm slightly improved the accuracy of the XYF network in the spoofing RPM and fuzzy datasets.
In Figures 8-10, the distances between each input message data vector X i to the corresponding winning unit returned by the XYF network were plotted for each map size. Table 6 also shows the related metrics. The boxplots highlight a clear separation in the distances for the normal and attack messages in all the datasets. This is less evident in the fuzzy dataset, due to the completely random structure of the dataset, and, consequently, in the mixed dataset. Variability in the distance is very low in the network trained with a 2 × 2 map both for normal and for attack messages with standard deviation <0.01 in all the datasets. An increase in variability is evident when training the network using the expanded maps. These results are consistent with the accuracy of the models and confirm how clearly the XYF Kohonen network highlights a distinction between the attack and normal message data. The higher the separation in distance between the two groups, the higher the accuracy of prediction is.    For the XYF-K algorithm, we computed all the evaluation metrics as described in Section 5.4. They are shown in Table 7. Precision, recall and F1 assume values of 100% in the DoS, spoofing gear and RPM datasets for all map sizes. High levels of precision are also shown for the fuzzy dataset, 99.77% using the 8 × 8 neurons map, and for the mixed dataset, 99.19% with the 4 × 4 neurons map. A relevant result is given by the FNR that is 0% in the DoS, spoofing gear and RPM datasets, 0.90% in the fuzzy dataset and 1.61% in the mixed dataset, both with 4 × 4 and 8 × 8 neurons maps.  Table 8 shows a comparison between the XYF-K algorithm and other IDS approaches proposed to detect attack messages sent on the CAN bus. We compared the proposed method in terms of the  For the XYF-K algorithm, we computed all the evaluation metrics as described in Section 5.4. They are shown in Table 7. Precision, recall and F1 assume values of 100% in the DoS, spoofing gear and RPM datasets for all map sizes. High levels of precision are also shown for the fuzzy dataset, 99.77% using the 8 × 8 neurons map, and for the mixed dataset, 99.19% with the 4 × 4 neurons map. A relevant result is given by the FNR that is 0% in the DoS, spoofing gear and RPM datasets, 0.90% in the fuzzy dataset and 1.61% in the mixed dataset, both with 4 × 4 and 8 × 8 neurons maps.  Table 8 shows a comparison between the XYF-K algorithm and other IDS approaches proposed to detect attack messages sent on the CAN bus. We compared the proposed method in terms of the system's accuracy with all the studies that, at the state of the art, tested different IDS approaches on the same dataset (RQ2).  [30]. Experimental results showed that their approach reached a higher performance compared to other machine learning algorithms, in particular, long short-term memory, artificial neural network, support vector machine, k-nearest neighbors, naive Bayes, and decision trees. Olufowobi et al. suggested a specification-based IDS using a response time analysis of the CAN bus. They restricted the analysis only checking periodic and sporadic messages [38]. Seo et al. proposed an anomaly-based IDS using a deep-learning model, in particular, generative adversarial nets (GAN) [27].
Results (Table 8) show how the XYF-K algorithm presents superior performance compared to all the other methods for all the datasets. Only the DCNN proposed by [30] reveals a slightly better accuracy for the fuzzy dataset, in particular, the authors claim an accuracy of 99.82% while in our study it is equal to 99.75%. Overall, we can affirm that the XYF-K algorithm improves the performance than those of the state of the art. Figures 11-13 show the training process of the XYF Kohonen network for the different map sizes on all the datasets. It is evident that the networks trained on 2 × 2 grids are not stable in convergence, whereas networks trained on expanded 4 × 4 and 8 × 8 neurons maps have more stability in convergence. The networks trained on both the 4 × 4 and 8 × 8 neurons grids start to stabilize after 450 iterations.
system's accuracy with all the studies that, at the state of the art, tested different IDS approaches on the same dataset (RQ2).
Song et al. proposed an anomaly-based IDS using a supervised deep convolutional neural network (DCNN) [30]. Experimental results showed that their approach reached a higher performance compared to other machine learning algorithms, in particular, long short-term memory, artificial neural network, support vector machine, k-nearest neighbors, naive Bayes, and decision trees. Olufowobi et al. suggested a specification-based IDS using a response time analysis of the CAN bus. They restricted the analysis only checking periodic and sporadic messages [38]. Seo et al. proposed an anomaly-based IDS using a deep-learning model, in particular, generative adversarial nets (GAN) [27].
Results (Table 8) show how the XYF-K algorithm presents superior performance compared to all the other methods for all the datasets. Only the DCNN proposed by [30] reveals a slightly better accuracy for the fuzzy dataset, in particular, the authors claim an accuracy of 99.82% while in our study it is equal to 99.75%. Overall, we can affirm that the XYF-K algorithm improves the performance than those of the state of the art.

Conclusions
The increasing diffusion of the connected devices in modern vehicles implies a lack in security of the in-vehicle systems such as the CAN bus. Since the CAN bus protocol does not provide security systems against cyber and physical attacks, it is desirable to have a system to detect any kind of attack.
In the present work, we implemented a supervised distance-based intrusion-detection system with the aim of identifying the attack messages injected on a CAN bus. We performed two different methods, the XYF Kohonen network and the XYF-K algorithm. We tested both the models on four open source car hacking datasets containing different types of attack, namely, DoS, spoofing gear, spoofing RPM, and fuzzy. First, single datasets were trained one by one, then the four datasets were combined together. The XYF Kohonen network showed a great performance in identifying attack messages injected on the CAN bus in all the datasets. The XYF-K algorithm, which processed the output of the XYF network using a K-means algorithm, slightly improved the performance of the network. Despite the high number of messages with different CAN ID identifiers sent at different times and with different frequencies, the experimental results showed that both the models completely succeeded in the prevision of all types of attack messages in DoS, spoofing gear and spoofing RPM. Moreover, both models achieved high levels of accuracy even in the fuzzy dataset, despite the presence of 2017 different CAN ID identifiers sent completely at random, and in the mixed dataset.

Conclusions
The increasing diffusion of the connected devices in modern vehicles implies a lack in security of the in-vehicle systems such as the CAN bus. Since the CAN bus protocol does not provide security systems against cyber and physical attacks, it is desirable to have a system to detect any kind of attack.
In the present work, we implemented a supervised distance-based intrusion-detection system with the aim of identifying the attack messages injected on a CAN bus. We performed two different methods, the XYF Kohonen network and the XYF-K algorithm. We tested both the models on four open source car hacking datasets containing different types of attack, namely, DoS, spoofing gear, spoofing RPM, and fuzzy. First, single datasets were trained one by one, then the four datasets were combined together. The XYF Kohonen network showed a great performance in identifying attack messages injected on the CAN bus in all the datasets. The XYF-K algorithm, which processed the output of the XYF network using a K-means algorithm, slightly improved the performance of the network. Despite the high number of messages with different CAN ID identifiers sent at different times and with different frequencies, the experimental results showed that both the models completely succeeded in the prevision of all types of attack messages in DoS, spoofing gear and spoofing RPM. Moreover, both models achieved high levels of accuracy even in the fuzzy dataset, despite the presence of 2017 different CAN ID identifiers sent completely at random, and in the mixed dataset.

Conclusions
The increasing diffusion of the connected devices in modern vehicles implies a lack in security of the in-vehicle systems such as the CAN bus. Since the CAN bus protocol does not provide security systems against cyber and physical attacks, it is desirable to have a system to detect any kind of attack.
In the present work, we implemented a supervised distance-based intrusion-detection system with the aim of identifying the attack messages injected on a CAN bus. We performed two different methods, the XYF Kohonen network and the XYF-K algorithm. We tested both the models on four open source car hacking datasets containing different types of attack, namely, DoS, spoofing gear, spoofing RPM, and fuzzy. First, single datasets were trained one by one, then the four datasets were combined together. The XYF Kohonen network showed a great performance in identifying attack messages injected on the CAN bus in all the datasets. The XYF-K algorithm, which processed the output of the XYF network using a K-means algorithm, slightly improved the performance of the network. Despite the high number of messages with different CAN ID identifiers sent at different times and with different frequencies, the experimental results showed that both the models completely succeeded in the prevision of all types of attack messages in DoS, spoofing gear and spoofing RPM. Moreover, both models achieved high levels of accuracy even in the fuzzy dataset, despite the presence of 2017 different CAN ID identifiers sent completely at random, and in the mixed dataset.
The limit of the model lies in the fact that the SOM network can be time-consuming since it performs a computation at every step of the algorithm. Nevertheless, the architecture of the network enables to run the algorithm in parallel computations. Thus, the diffusion of multi core computational architectures overcomes limitations due to the complexity of the network architecture enhancing the interest of researchers in the application of the Kohonen SOM networks in security issues [15].
In this paper, we tested the proposed method on a limited number of attacks, due to the availability of the open source dataset we analyzed. In future works, we will extend the analysis to other types of attacks. Moreover, we will investigate the development of the Kohonen SOM network as anomaly detector on a CAN bus, also comparing the supervised and unsupervised learning.