A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks

Wang, Shanshan; Zhou, Hainan; Zhao, Haihang; Wang, Yi; Cheng, Anyu; Wu, Jin

doi:10.3390/electronics13071317

Open AccessArticle

A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks

by

Shanshan Wang

^1,2,

Hainan Zhou

^1,2,

Haihang Zhao

^1,2,

Yi Wang

^3,*

,

Anyu Cheng

^1,2,*

and

Jin Wu

^1,2

¹

School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

School of Industrial Internet, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

³

Product Cybersecurity & Privacy Office, Continental Automotive Singapore, Singapore 339780, Singapore

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(7), 1317; https://doi.org/10.3390/electronics13071317

Submission received: 7 March 2024 / Revised: 28 March 2024 / Accepted: 29 March 2024 / Published: 31 March 2024

Download

Browse Figures

Versions Notes

Abstract

Software-defined vehicles (SDVs) make automotive systems more intelligent and adaptable, and this transformation relies on hybrid automotive in-vehicle networks that refer to multiple protocols using automotive Ethernet (AE) or a controller area network (CAN). Numerous researchers have developed specific intrusion-detection systems (IDSs) based on ResNet18, VGG16, and Inception for AE or CANs, to improve confidentiality and integrity. Although these IDSs can be extended to hybrid automotive in-vehicle networks, these methods often overlook the requirements of real-time processing and minimizing of the false positive rate (FPR), which can lead to safety and reliability issues. Therefore, we introduced an IDS based on the Swin Transformer to bolster hybrid automotive in-vehicle network reliability and security. First, multiple messages from the traffic assembly are transformed into images and compressed via two-dimensional wavelet discrete transform (2D DWT) to minimize parameters. Second, the Swin Transformer is deployed to extract spatial and sequential features to identify anomalous patterns with its attention mechanism. To compare fairly, we re-implemented up-to-date conventional network models, including ResNet18, VGG16, and Inception. The results showed that our method could detect attacks with 99.82% accuracy and 0 FPR, which saved 14.32% in time costs and improved the accuracy by 1.60% compared to VGG16 when processing 512 messages.

Keywords:

hybrid automotive in-vehicle network; IDS; AE; CAN; Swin Transformer; 2D DWT

1. Introduction

As software-defined vehicles (SDVs) move from concept to reality, modern automobiles are evolving into sophisticated computer systems, integrating complex network communication, both internally and externally. In-vehicle network systems serve as crucial interfaces connecting diverse subsystems within the vehicles and facilitating interactions with the external environment. However, the safety and reliability requirements need to be carefully addressed for communication within automotive in-vehicle networks to ensure the quality of vehicles.

Functional safety, as defined by ISO 26262 [1], addresses the safety needs associated with the electrical and electronic systems within vehicles. This standard emphasizes hazard and risk assessment, demanding the implementation of safety measures designed to mitigate to acceptable levels the risks associated with system failures. It underscores the importance of identifying potential hazards early in the design process, ensuring that system responses to failures are both timely and accurate, to prevent harm.

Popularly used communication protocols within automotive in-vehicle networks include automotive Ethernet (AE) and controller area networks (CAN). The CAN bus, a long-standing backbone for in-vehicle communications, facilitates real-time data exchange between various electronic control units (ECUs), supporting essential vehicle functions from engine management systems to brake control systems. Its simplicity and robustness have made it a standard. However, its lack of encryption and authentication exposes it to cybersecurity risks, potentially compromising vehicle safety. On the other hand, AE caters to the increasing demand for higher bandwidth and faster data transmission rates required by modern vehicles’ advanced features, such as infotainment systems and advanced driver assistance systems (ADAS). Despite its advantages, AE inherits the vulnerabilities of standard Ethernet technology. It lacks security designs, such as access control. Once hackers gain internal access, AE may become vulnerable to attacks such as frame injection or denial-of-service (DoS) attacks [2]. To address the problems mentioned above, many researchers have used intrusion-detection systems (IDSs) to identify and mitigate cyber threats in these vehicular networks. The role of an IDS is to monitor network traffic for signs of suspicious activities. Extensive research has been conducted on IDSs tailored for CANs, audio/video transport protocol (AVTP), scalable service-oriented middleware over IP (SOME/IP), and generalized precision time protocol (gPTP) protocols. The following presents a summary of the pertinent research in this field.

Many scholars have researched the enhancement of integrity and confidentiality of the CAN bus. Given the CAN bus maximum data payload of only 8 bytes, implementing security countermeasures poses a significant challenge. Such countermeasures can lead to additional data processing and computational burdens, exacerbated by the bus’s inherent packet size constraints, resulting in unacceptable communication delays in in-vehicle networks with high real-time requirements. Consequently, this scenario has prompted a considerable amount of research into IDSs for the CAN bus, aiming to enhance security, without compromising on the essential real-time requirements. As early as 2013, Miller et al. [3] pioneered the field of CAN network security by introducing message rate analysis as a technique for intrusion detection, employing a straightforward analytical approach. Building on this foundation, Gmiden et al. [4] and Song et al. [5] further advanced the field by proposing methods that leveraged the time interval characteristics of CAN bus messages for intrusion detection. Additionally, Choi et al. [6] unveiled VoltageIDS, an innovative system that capitalizes on the unique electronic signal characteristics of the CAN bus for enhanced intrusion detection capabilities. While these methods have proven effective, their inability to detect unknown attacks has been a notable limitation, steering the research community toward exploring deep-learning-based solutions. Kang et al. [7] responded to this challenge by introducing a deep neural network (DNN)-based IDS that analyzes statistical features of CAN messages. Despite their promise, such deep learning approaches have encountered barriers in maintaining consistent performance and computational efficiency across varying scenarios. To overcome these obstacles, Hossain introduced a long short-term memory (LSTM)-based IDS that aimed to address the shortcomings of earlier methods. Following this, Seo et al. [8] proposed a novel generative adversarial network (GAN)-based system focusing on the sequence features of the identifier, and Gao et al. [9] developed a hybrid intrusion detection approach that combined empirical knowledge structures, decisional DNA, and deep learning techniques. Consequently, significant advancements have been achieved in research focused on the CAN bus.

In recent years, the adoption of Ethernet in automotive applications has gained popularity. AE supports high-bandwidth data transmission and is well suited to the real-time transmission of large data volumes. However, as its use has expanded, security vulnerabilities have increasingly come to light. To enhance AE’s security, several scholars have developed IDSs for its various protocols. For instance, Buscemi et al. [10] introduced a machine-learning-based IDS for the gPTP protocol. Koyama et al. [11] proposed a whitelist-based IDS for the SOME/IP protocol, addressing the need for systems that can adapt to new network attacks. Alkhatib et al. [12] developed an offline IDS using a sequential deep learning model for intrusion detection, although this method lacked real-time performance analysis. Following this, Luo et al. [13] proposed a multi-layer IDS that improved accuracy and evaluated real-time performance, albeit without considering the false alarm rate. Additionally, Jeong et al. [14] created a feature generator and a convolutional neural network-based intrusion detection model for the AVTP protocol. They also explored a convolutional autoencoder-based IDS for the AVTP protocol [15], but it suffered from poor accuracy.

Recently, hybrid automotive in-vehicle networks have been increasingly used in modern gateways based on high-performance computing platforms. These networks combine multiple communication protocols (such as AE and CANs) to support various functionalities and data exchange requirements. This integration facilitates seamless information exchange through the gateway, while coping with the technological differences and operational variations inherent to the real-time operational and data processing requirements of various network types. As illustrated above, the IDSs proposed by these scholars addressed the security concerns of a single network protocol. However, traditional IDSs designed for specific protocols may not be directly applicable or effective within this multifaceted environment. As far as we know, there has been little research addressing the concerns of hybrid automotive in-vehicle networks. In terms of current research, only Han et al. [16] proposed a DNN model based TOW-IDS for hybrid automotive networks, in which they disclosed a dataset that contained the traffic of three protocols (AVTP, gPTP, CAN) and encompassed five distinct attack scenarios: (i) frame injection attack, (ii) gPTP synchronization attack, (iii) switch attack (media access control (MAC) flooding), (iv) CAN DoS attack, and (v) replay attack. Their efforts are commendable, but it is noteworthy that the issue of false positives remained unaddressed.

Given this, we propose an IDS based on the Swin Transformer model [17], and the dataset used was the one proposed by [16]. This approach can adapt to the characteristics of different network protocols and differences in message lengths to effectively identify and respond to a wide range of potential security threats, without significantly affecting the real-time data transmission of hybrid automotive in-vehicle networks. Specifically, our main contributions can be summarized as follows:

(1): To ensure the real-time capability of the IDS, we propose a preprocessing method involving packet imaging and 2D DWT compression of images. We validated the scalability of the model using images of varying resolutions. Experimental results demonstrated an 80 ms reduction in response time after compression.
(2): We propose a Swin Transformer-based intrusion detection method for AE and CAN hybrid automotive in-vehicle networks. The experimental results indicated a detection accuracy of up to 99% starting from a 64 × 64 image size, with a false-positive rate (FPR) of 0 at a 512 × 512 image size, thereby enhancing the security of hybrid automotive in-vehicle networks.
(3): To comprehensively validate the model’s effectiveness, we compared it with classical network models, including ResNet18, VGG16, and Inception. Our evaluation encompassed various metrics, including accuracy, precision, F1 score, recall, and FPR.

The rest of this paper is organized as follows: Section 2 reviews AE and CAN network fundamentals and provides an example to explain the types of attacks considered in this study. In Section 3, we introduce the Swin-Transformer-based intrusion detection method. A detailed implementation description and evaluation follow this in Section 4. Finally, Section 5 provides a summary of our work.

2. Background

2.1. Automotive Ethernet (AE)

AE was derived from the Ethernet protocol in the office-IT world, which references the IEEE 802.3u standard series [18].

AE operates in full-duplex mode through a pair of twisted copper wires. AE, as such, has a more straightforward hardware setup compared to its office-IT Ethernet counterpart, while maintaining a reasonably high speed.

More specifically, AE is referenced as T1 in the IEEE 802.3 series standards. The standards list several variants for AE: 100BASE-T1 for fast Ethernet in standard IEEE 802.3bw, 1000BASE-T1 for gigabit Ethernet in standard IEEE 802.3z, and 10BASE-T1 for extended-range Ethernet in standard IEEE 802.3cg [19,20].

AE also fits well into the seven-layer OSI network architecture [21]. AE covers the physical layer and the data link layer (Figure 1). Due to its similarity to its office-IT counterpart, AE allows the potential of porting in a wide variety of existing upper-layer communication protocols on top of Ethernet. Within the OSI layered architecture, AE encompasses the upper three layers, collectively known as the application layer, tailored to meet specific communication requirements in automotive contexts through protocol pairing.

2.2. Controller Area Network (CAN)

A CAN is a vehicle bus standard designed to allow microcontrollers and devices to communicate with each other in applications without a host computer. The protocol was officially released in 1986 at the Society of Automotive Engineers (SAE) conference in Detroit, Michigan. The latest version of this protocol was published in 1992 as CAN 2.0. This specification has two parts: part A is for the standard format with an 11-bit identifier (ID), and part B is for the extended format with a 29-bit ID. The CAN protocol uses the carrier sense multiple access with collision detection (CDMA/CD) access control method. CSMA/SD is decentralized, reliable, and priority-driven. It ensures that every time a top-priority message is transmitted, it is always transmitted first. An CAN allows data communication in the form of short messages over a serial bus, as shown in Figure 2. Operating on a bus topology, it facilitates efficient data exchange among ECUs or nodes, covering various aspects of vehicles like steering, revolutions per minute (RPM), and engine status, which transmit messages, including status updates, commands, alarms, and sensor data.

2.3. Use Case Study: Attacks on Hybrid Automotive In-Vehicle Networks

The trend in the automotive industry is from traditional mechanical systems to heavily relying on software systems to control and manage various functions, in which software plays a crucial role in defining the vehicle’s behavior, capabilities, and user experience. In this paper, we present a use case study to show various attacks on hybrid automotive in-vehicle networks. Figure 3 shows an example of a structure with two high-performance computers and four Zone-ECUs.

When a hacker perpetrates a malicious attack on a vehicle network system through either remote means (Wi-Fi, Bluetooth, and APP) or physical connections (on-board diagnostics (OBD), T-BOX) [22], seizing control of one of the ECUs (referred to as a malicious ECU in this study), various possible attacks may ensue within one of the four area controllers. Figure 4 illustrates the potential attacks, including gateways utilizing AE and CAN protocols.

According to [16], there are five types of attacks: (i) Frame injection attacks corrupt the MPEG video stream by inserting malicious frames, extracting the header “47 40” of a normal MPEG frame, and then adding false data into the normal output, resulting in distorted or lost images. (ii) A gPTP synchronization attack is carried out by injecting incorrect or manipulated synchronization messages into the network. The hacker corrupts the accuracy of the timing between the master and slave devices, for example, between an audio/video bridging (AVB) transmitter and a listener, during the initial synchronization process. (iii) A switch attack (MAC Flooding) involves sending a large number of packets with randomly generated MAC and IP addresses (shown as AAAA-AAAA-AAAA-AAAA in the figure) to a switch. This action overflows the MAC table buffer, resulting in the loss of switch functionality. (iv) A CAN DoS attack is executed by the attacked node (ECU 4), seizing the bus resources by sending a large number of high-priority invalid or false CAN messages (ID = 0x000) to the bus. This causes delays or loss of normal messages. (v) A CAN replay attack aims to cause the vehicle to perform specific actions such as unlocking and accelerating. This is achieved by the malicious node (ECU 7) re-sending normal CAN packet traffic, the hacker sends the captured unlock message (ID = 0x834, 0x2A1, 0x3B4) to the CAN again, as shown in the figure.

3. Proposed Method

This section presents the methodology used for implementing Swin-Transformer-based IDS for AE and CANs and its experiment analysis. The method comprises three key steps: (i) extraction of AE and CAN packets, (ii) normalization of the raw image data and compression using 2D DWT to generate the final image dataset, and (iii) intrusion detection and identification using the Swin Transformer deep learning algorithm. An overview of the proposed intrusion detection approach is outlined in Figure 5. Section 4.1 provides a brief overview of the dataset design and the experimental environment. Subsequently, the experimental results are presented, followed by a detailed analysis of the model’s performance based on comparative experiments.

3.1. Data Extraction

3.1.1. AE, CAN, and User Datagram Protocol (UDP) Message Formats

To enhance comprehension of the message traffic within the original dataset (comprising AE and UDP, where CAN messages were converted to UDP), the following presents a detailed description of the AE, CAN, and UDP message formats.

AE is based on the TCP/IP network model. When data are received at the application layer, they are assigned with a UDP header when passing through the transport layer, an IP header when passing through the network layer, and then MAC address and other information when passing through the data link layer. Finally, they are converted into a binary data stream by the physical chip to facilitate data interaction between the sender and the receiver. The complete structure of the AE packet is illustrated in Figure 6.

The frame contains destination and source MAC addresses, each occupying 6 bytes. Additionally, the frame contains a VLAN tag to partition the LAN for virtual work data exchange. The VLAN tag occupies 4 bytes, to indicate priority and network segment information. The AE type of the frame occupies two bytes to specify the format used in the frame, which involved the AVTP (0x22F0) and gPTP (0x88F7) protocols in this study. The data segment comprises a UDP header, IP header, and data field, capable of occupying up to 1500 bytes at maximum and 46 bytes at minimum. Finally, the frame sequence detection occupies 4 bytes, to ensure the integrity of the frame during transmission.

In the CAN 2.0B protocol [23], CAN messages are classified into four types: data frames, remote frames, error frames, and overload frames. Typically, data frames are utilized to transmit actual data. These frames are further divided into standard and extended frames, depending on the number of bits in the identifier used. Figure 7 illustrates the CAN standard data frame format comprising seven key parts.

First, there is the 1-bit start of frame identifier (SOF), signaling the frame’s initiation. Following this is the 11-bit identifier (ID), where a smaller ID indicates a higher priority. The IDE bit determines whether the standard or extended frame format is employed; setting the IDE bit to 0 indicates usage of the extended frame format with a 29-bit ID. Additionally, there are 1-bit remote frame flags (RTRs), indicating if the frame is a remote frame, along with a 6-bit control field, a 64-bit data field, and a 16-bit cyclic redundancy check (CRC) field used for data integrity verification. Lastly, there is a 2-bit acknowledgment (ACK) field and a seven-bit end-of-frame character to signify the frame’s conclusion. Only standard frames are covered in this study.

To enable the transmission of CAN messages over AE, Han and his team converted CAN to UDP. UDP provides basic services like multiplexing, splitting, and error detection atop the datagram services of IP. The complete format is illustrated in Figure 8.

UDP datagrams consist of a header section and a user data section [24]. The UDP header is 8 bytes in size, with the source and destination port number fields, the length field, and the checksum field each occupying 2 bytes. The data payload (CAN message) follows. When UDP checksum is required, a pseudo-header is appended before the UDP header, comprising 4 bytes of source IP address, 4 bytes of destination IP address, 1 byte of all zeros, 1 byte of protocol number, and finally 2 bytes of UDP length.

3.1.2. Introduction to the Dataset

The IDS utilizes an AE and CAN dataset comprising two pcap files (a file format for storing raw data packets transmitted across a network) for training and testing. Additionally, the dataset includes two csv files (a file format for storing tabular data in plain text form) assigning labels to each packet in the pcaps. Six labels denote various packet types: ‘normal’ (unattached), ‘P_I’ (PTP synchronization attack), ‘M_F’ (switch MAC flooding attack), ‘F_I’ (frame injection attack), ‘C_D’ (CAN DoS attack), and ‘C_R’ (CAN replay attack). We utilized the Wireshark parser pyshark and the packet grabber tshark, both packets encapsulated in Python, to grab and parse pacp files. Table 1 outlines the packet distribution for each label in both datasets, with the training set containing 1,196,737 packets and the test set containing 791,611 packets.

The content contained in the packet flow file is illustrated in Figure 9. The CAN message numbered 162,590, highlighted in grey within the figure, was already represented in UDP format within the pcap file. This message is utilized as an illustrative example. The C_D attack message (annotated by the csv file) incorporates all the UDP content depicted in Figure 8, alongside timestamp and additional information.

3.2. Data Pre-Processing

Each AE protocol carries data of varying lengths, such as AVTP with 434 bytes, gPTP with 60–90 bytes, and CAN-to-UDP packets with 60 bytes. To prepare the dataset for training the Swin Transformer model for intrusion detection, multiple images of size N × M are generated from the network packets (raw data), where N is the number of packets and M is the packet length. If M is smaller than the selected image width, padding with zeros is applied. Conversely, if M exceeds the image width, excess bytes are discarded to match the image width. Subsequently, each byte in the packet is converted from hexadecimal to decimal notation to facilitate data processing and analysis using Equation (1). After this step, the range of data values become 0–255, and then the N decimal-number messages of length M are formed into N×M ×3 images, allowing the extraction of the original image data.

d^{m} = h_{1}^{k} \times 16^{1} + h_{0}^{k} \times 16^{0} (k = 1, 2, \dots, m),

(1)

Since the value of each byte in the network packet ranges from 0 to 255, this continuous and wide-ranging data disparity may result in gradient explosion and vanishing problems during model training. Therefore, to enhance the data processing efficiency and alleviate this issue, the data converted to decimal values are normalized through linear transformation to constrain the data between 0 and 1. Equation (2) illustrates the normalization process.

D a t a n o r m a l i z a t i o n = \frac{d_{n}^{m}}{255} (m = 1, 2, \dots, M), (n = 1, 2, \dots, N)

(2)

Finally, considering the real-time demands and computational constraints in automotive intrusion detection, we address the challenge posed by the large algorithm parameters and model sizes resulting from extensive input data. To manage this, we employ 2D DWT for image compression (as illustrated in [16,25,26], the 2D DWT preserves the core information of the image more effectively than the 1D DWT and is more computationally resource-efficient than the continuous wavelet transform). This method effectively compresses data while preserving essential image information by decomposing the input image into four sub-bands: low–low (LL), low–high (LH), high–low (HL), and high–high (HH). LL approximates the input image at approximately 1/4 of its original size, while LH, HL, and HH capture horizontal, vertical, and diagonal features, respectively. To optimize computational resources, we utilize three different wavelet filters (level 1 decomposition: Coiflet 1, level 2 decomposition: Daubechies 3, level 3 decomposition: and Reverse biorthogonal 1.3). The LL sub-bands generated by these filters are combined to produce RGB images corresponding to the network packets, as illustrated in Figure 10. These images serve as inputs to the Swin Transformer model during the training, validation, and testing phases of intrusion detection.

This preprocessing approach enhances the computational efficiency, while ensuring a robust representation of the input data, enabling the IDS to respond promptly and accurately to complex network traffic.

3.3. Model Architecture

In this study, we employ the Swin Transformer [17] as the principal model for intrusion detection using image datasets derived from AE message traffic. Given the real-time demands and computational limitations inherent in the in-vehicle network environment, the structural attributes of the Swin Transformer are notably well-suited. The model refines the conventional Transformer by integrating a self-attention mechanism based on moving windows, which adeptly addresses global features in large-scale images, while also mitigating computational complexity.

The Swin Transformer, illustrated in Figure 11, showcases a sophisticated architecture, starting with a convolutional layer for the initial feature extraction. This is followed by a linear embedding layer, or fully connected layer, which prepares the data for the model’s core: multiple Swin Transformer blocks (Swin Blocks, in Figure 12). Each Swin Block is engineered with four essential modules: layer normalization (LN) for input standardization; windowed multi-head self-attention (W-MSA) and its variant, shifted window multi-head self-attention (SW-MSA), for focused and comprehensive context gathering; and a multi-layer perceptron (MLP) for complex feature extraction. Patch merging further enhances the architecture, consolidating data and creating a global adaptive pooling layer that adapts to different input sizes.

In this study, we converted AE traffic messages into RGB images, serving as inputs to the Swin Transformer. This transformation encodes complex traffic data into a visual format, highlighting essential traffic features and potential anomaly patterns. The hierarchical structure of the Swin Transformer is particularly well suited to this data type because it filters out redundant features, while retaining critical information.

The model processing begins with the patch partition stage, where the input RGB image is divided into non-overlapping 16×16 blocks and flattened (patch_size is 16 in the program). The image proceeds through a linear embedding layer to stage 1, comprising two Swin Blocks for feature extraction and processing. The Swin Transformer iteratively enhances its understanding of the image across multiple stages, including patch merging and Swin Blocks. Each patch merging operation reduces spatial dimensions, while increasing the depth and refining and enriching the feature map.

In the final classification stage, the feature maps processed by Swin Transformer are classified through the full connectivity layer to determine whether the traffic is abnormal or not. This entire process not only improves the efficiency of the model in handling complex network traffic but also ensures that key features are effectively captured, thus enabling the IDS to respond quickly and accurately in the face of complex hybrid automotive in-vehicle network environments.

4. Experimental Results

4.1. Experimental Setup

The dataset comprises compressed images of consecutive packets (N × M), where each image is flagged as anomalous if N contains an anomalous message. Due to the varying packet lengths across automotive network protocols, image data of different resolutions (32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512) were utilized for multi-scale detection, testing the model’s generalization ability. Table 2 outlines the number of normal and abnormal images in the final dataset for each resolution. The experiment was conducted on a device equipped with an Intel 4790K CPU, 32GB RAM, and an NVIDIA 4070 RTX GPU. TensorFlow library was employed for implementing the deep learning algorithms. The dataset was divided into training and test sets in an 8:2 ratio, with model parameters set to 100 training epochs and a batch size of 10.

4.2. Results of the Proposed Method and Re-Implemented Existing Methods

We constructed a confusion matrix for intrusion detection to assess the proposed IDS model’s detection performance. True positive (TP) represents the correctly predicted attacks, while false positive (FP) denotes normal instances incorrectly classified as attacks. Conversely, true negative (TN) signifies accurately predicted normal instances, and false negative (FN) indicates attacks misclassified as normal. Utilizing this confusion matrix, we evaluated the IDS performance using metrics including accuracy, precision, recall, false negative rate (FNR), and false positive rate (FPR). FNR and FPR are crucial for ensuring the model’s utility and reliability. The evaluation metrics are summarized in Equations (3)–(8).

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

F P R = \frac{F P}{F P + T N}

(6)

F N R = \frac{F N}{T P + F N}

(7)

Furthermore, the F1 score, a harmonic mean of precision and recall, offers a more comprehensive evaluation of model performance than accuracy alone, particularly in scenarios with imbalanced category distributions. The F1 score was calculated using the formula below:

F 1 - S c o r e = \frac{2 \cdot P r e i c i s o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

As can be learned from Table 3, Swin Transformer consistently achieved a detection precision and accuracy above 82% across the various resolutions. With increasing resolution, there was a corresponding increase in the scores of each evaluation metric and a decrease in the FPR. This trend was attributed to the higher resolution enabling coverage of more messages per detection, inclusion of more time-step information in the image, and better representation of the relationships between messages, thereby allowing the model to capture more features. Notably, at a resolution of 64 × 64, an accuracy of 99% was attained. Particularly impressive was the performance at a resolution of 512 × 512, where the detection accuracy reached an exceptional 99.82%, with an FPR of 0. This underscored the generalization ability of the Swin Transformer model, especially in reliably detecting anomalous network traffic in AE and CAN.

However, it is noteworthy that the performance metrics were higher at 64 × 64 compared to 128 × 128, potentially due to the introduction of unnecessary noise with a message length of 128, without providing effective features. Additionally, experiments conducted on the dataset without 2D DWT, forming images after converting the raw message data to decimal numbers, resulted in a decreased accuracy and other metrics. The response time increased by approximately 80 ms for an image size of 512 × 512 (as shown in Figure 13), which is about five times the time cost of the method proposed in this study. This substantiated that our method enhanced the efficiency while maintaining detection accuracy.

4.3. Summary of Performance Comparison

To comprehensively evaluate the performance of our proposed method, we conducted comparison experiments with several classical networks, including ResNet18 [27], VGG16 [28], and Inception [29], for which the dataset partitioning ratio (80% for training dataset, 20% for testing dataset), epoch (100) and batch size (10) settings were the same as for the Swin Transformer model, and they were run on the same hardware devices. Figure 14 illustrates the trend in accuracy for each model across varying resolutions. It is evident from Figure 14 that, as the resolution increased, the accuracy of each model consistently improved. Specifically, at a resolution of 128 × 128, all networks achieved an accuracy exceeding 93%, with the Swin Transformer notably reaching an impressive 99.82% accuracy. Overall, our chosen model outperformed the others.

Moreover, at an image size of 64 × 64, the accuracy reached 99%, further validating the superiority of our method in detecting anomalous network traffic. Upon further analysis of the results presented in Table 4, it becomes apparent that the model performed optimally when the image size was 512 × 512. It exhibited a 1.64% improvement in accuracy compared to the other best-performing VGG16 model, with its F1 score, recall, and precision all exceeding 0.99. This underscores its exceptional performance in the anomaly detection task.

Of particular significance was the fact that the FPR of our method was 0. A low FPR reduces the cost and burden associated with dealing with false alarms, while allowing for a greater focus on real threats. This serves as additional evidence of the reliability of our method in real-world scenarios.

5. Conclusions

With the development of the concept of SDVs, the future design of an IDS with high accuracy, real-time performance, and reliability for hybrid vehicle in-vehicle networks is an ongoing research problem. In order to have a high level of reliability and safety in today’s automobiles, it is imperative to address the security threats to hybrid automotive in-vehicle networks consisting of AE and CANs. Currently, security research related to hybrid vehicle in-vehicle networks is in the preliminary stage. Most of the existing IDSs focus on detecting attack types of one protocol. In this study, we proposed a novel IDS to solve the problem of a high FPR, low real-time response, or low accuracy.The proposed method was based on the Swin Transformer, which focuses on three protocols: AVTP, gPTP, and UDP (CAN). In Step 1, N messages of length M in the packet are converted into an image, preserving its spatial features. In Step 2, to save time for model training and detection, the original image data are compressed using 2D DWT, which reduces the parameters and preserves the core information. Finally, the processed image is input into Swin Transformer for training and detection. To prove the advantages of this model in various aspects, we selected the classical network models ResNet18, VGG16, and Inception in image recognition and classification for comparison experiments. Our method demonstrated a high detection accuracy of 99.82% and a 0 false positive rate, F1 score, and recall rate of more than 99%. In addition, using 2D DWT compression resulted in significant time cost savings, reducing the response time for detecting 512 messages from 100.925 ms to 19.163 ms—a fivefold reduction. Notably, the experimental results of this study may have been limited by the characteristics of the used datasets and model parameters. Therefore, to better adapt to real-world application scenarios, future research needs to validate a wider range of datasets and further explore the optimization and light weight of model parameters. In addition, considering the continuous evolution of the network environment of autonomous vehicles, continuous research and updating of intrusion detection systems to adapt to new security threats and technological changes will also be an important direction for future work.

Author Contributions

Conceptualization, S.W.; methodology, S.W. and H.Z. (Hainan Zhou); software, S.W., H.Z. (Haihang Zhao) and H.Z. (Hainan Zhou); validation, S.W.; investigation, H.Z. (Hainan Zhou), S.W. and J.W.; resources, A.C.; data curation, H.Z. (Hainan Zhou) and J.W.; writing—original draft preparation, S.W. and Y.W.; writing—review and editing, S.W. and Y.W.; visualization, S.W. and H.Z. (Haihang Zhao); supervision, A.C. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

ISO 26262; Road Vehicles—Functional Safety. International Organization for Standardization (ISO): Geneva, Switzerland, 2011.
Mukherjee, S.; Shirazi, H.; Ray, I.; Daily, J.; Gamble, R.F. Practical DoS Attacks on Embedded Networks in Commercial Vehicles. In Proceedings of the 12th International Conference (ICISS 2016), Jaipur, India, 16–20 December 2016; pp. 23–42. [Google Scholar] [CrossRef]
Miller, C.; Valasek, C. Adventures in automotive networks and control units. Def Con 2013, 21, 15–31. [Google Scholar]
Gmiden, M.; Gmiden, M.H.; Trabelsi, H. An intrusion detection method for securing in-vehicle CAN bus. In Proceedings of the 2016 17th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA), Sousse, Tunisia, 19–21 December 2016; pp. 176–180. [Google Scholar]
Song, H.M.; Kim, H.R.; Kim, H.K. Intrusion detection system based on the analysis of time intervals of CAN messages for in-vehicle network. In Proceedings of the 2016 International Conference on Information Networking (ICOIN 2016), Kota Kinabalu, Malaysia, 13–15 January 2016; pp. 63–68. [Google Scholar] [CrossRef]
Choi, W.; Joo, K.; Jo, H.J.; Park, M.C.; Lee, D.H. VoltageIDS: Low-Level Communication Characteristics for Automotive Intrusion Detection System. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2114–2129. [Google Scholar] [CrossRef]
Kang, M.J.; Kang, J.W. Intrusion detection system using deep neural network for in-vehicle network security. PLoS ONE 2016, 11, e0155781. [Google Scholar] [CrossRef] [PubMed]
Seo, E.; Song, H.M.; Kim, H.K. GIDS: GAN based Intrusion Detection System for In-Vehicle Network. In Proceedings of the 2018 16th Annual Conference on Privacy, Security and Trust (PST), Belfast, Ireland, 28–30 August 2018. [Google Scholar]
Gao, L.; Li, F.; Xu, X.; Liu, Y. Intrusion detection system using SOEKS and deep learning for in-vehicle security. Clust. Comput. 2019, 22, 14721–14729. [Google Scholar] [CrossRef]
Buscemi, A.; Ponaka, M.; Fotouhi, M.; Jomrich, F.; Köbel, C.; Engel, T. An Intrusion Detection System Against Rogue Master Attacks on gPTP. In Proceedings of the 97th IEEE Vehicular Technology Conference, VTC Spring 2023, Florence, Italy, 20–23 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
Koyama, T.; Tanaka, M.; Miyajima, A.; Ukai, S.; Sugashima, T.; Egawa, M. SOME/IP Intrusion Detection System Using Real-Time and Retroactive Anomaly Detection. In Proceedings of the 95th IEEE Vehicular Technology Conference, VTC Spring 2022, Helsinki, Finland, 19–22 June 2022; pp. 1–7. [Google Scholar] [CrossRef]
Alkhatib, N.; Ghauch, H.; Danger, J. SOME/IP Intrusion Detection using Deep Learning-based Sequential Models in Automotive Ethernet Networks. In Proceedings of the 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 27–30 October 2021. [Google Scholar]
Luo, F.; Yang, Z.; Zhang, Z.; Wang, Z.; Wang, B.; Wu, M. A Multi-Layer Intrusion Detection System for SOME/IP-Based In-Vehicle Network. Sensors 2023, 23, 4376. [Google Scholar] [CrossRef] [PubMed]
Jeong, S.; Jeon, B.; Chung, B.; Kim, H.K. Convolutional neural network-based intrusion detection system for AVTP streams in automotive Ethernet-based networks. Veh. Commun. 2021, 29, 100338. [Google Scholar] [CrossRef]
Alkhatib, N.; Mushtaq, M.; Ghauch, H.; Danger, J. AVTPnet: Convolutional Autoencoder for AVTP anomaly detection in Automotive Ethernet Networks. arXiv 2022, arXiv:2202.00045. [Google Scholar]
Han, M.L.; Kwak, B.I.; Kim, H.K. TOW-IDS: Intrusion Detection System Based on Three Overlapped Wavelets for Automotive Ethernet. IEEE Trans. Inf. Forensics Secur. 2023, 18, 411–422. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
802.3u-1995; IEEE Standards for Local and Metropolitan Area Networks: Supplement—Media Access Control (MAC) Parameters, Physical Layer, Medium Attachment Units, and Repeater for 100 Mb/s Operation, Type 100BASE-T (Clauses 21–30). IEEE: New York, NY, USA, 1995; pp. 1–415.
802.3bw-2015; IEEE Standard for Ethernet Amendment 1: Physical Layer Specifications and Management Parameters for 100 Mb/s Operation over a Single Balanced Twisted Pair Cable (100BASE-T1). IEEE: New York, NY, USA, 2016; pp. 1–88.
Frazier, H. The 802.3z Gigabit Ethernet Standard. IEEE Netw. 1998, 12, 6–7. [Google Scholar] [CrossRef]
ISO/IEC 7498-1:1994; Information Technology—Open Systems Interconnection—Basic Reference Model: The Basic Model. International Organization for Standardization (ISO): Geneva, Switzerland, 1994.
Rajapaksha, S.; Kalutarage, H.; Al-Kadri, M.O.; Petrovski, A.; Madzudzo, G.; Cheah, M. AI-Based Intrusion Detection Systems for In-Vehicle Networks: A Survey. ACM Comput. Surv. 2023, 55. [Google Scholar] [CrossRef]
J1939_201308; Recommended Practice for a Serial Control and Communication Vehicle Network. SAE International: Warrendale, PA, USA, 2013.
Internet Engineering Task Force (IETF). Internet Engineering Task Force (IETF). RFC 768. 1980. Available online: https://tools.ietf.org/html/rfc768 (accessed on 28 February 2024).
Cheng, Y.; Lin, M.; Wu, J.; Zhu, H.; Shao, X. Intelligent fault diagnosis of rotating machinery based on continuous wavelet transform-local binary convolutional neural network. Knowl.-Based Syst. 2021, 216, 106796. [Google Scholar] [CrossRef]
Nobre, J.; Neves, R.F. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst. Appl. 2019, 125, 181–194. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]

Figure 1. Mapping of automotive Ethernet to OSI layers.

Figure 2. Controller area network bus protocol.

Figure 3. Example: structure with two high-performance computers and four Zone-ECUs.

Figure 4. Five attack scenarios in hybrid automotive in-vehicle networks (the dataset creator does not specify the control command represented by the message data).

Figure 5. Intrusion detection process based on Swin Transformer.

Figure 6. Automotive Ethernet packet format.

Figure 7. Controller area network standard frame format.

Figure 8. This is the complete message format for the user datagram protocol (UDP).

Figure 9. Detailed analysis of pcap message traffic (utilizing the ’C_D’ message as an exemplar).

Figure 10. Data pre-processing process.

Figure 11. Structure of the Swin Transformer model.

Figure 12. Specific components of a Swin Block.

Figure 13. Swin Transformer response time (with or without 2D DWT).

Figure 14. Accuracy of different models for multiple image sizes).

Table 1. Number of packets per type in the dataset ¹.

Attack Type	Train Dataset		Test Dataset
Attack Type	Number	Ratio	Number	Ratio
Normal	947,912	79.3%	660,777	83.5%
P_I	35,112	2.9%	16,962	2.1%
M_F	64,635	5.4%	26,013	3.3%
F_I	33,765	2.8%	16,809	2.1%
C_D	85,466	7.1%	41,203	5.2%
C_R	29,847	2.5%	29,847	3.8%
Total	1,196,737	100%	791,611	100%

¹ Referring to the data in the literature [16] by Han et al. from Koryo University, Korea.

Table 2. Dataset with Pre-Processed Images.

Image Size	Dataset
Image Size	Normal	Abnormal
32 × 32	31,711	30,642
64 × 64	15,438	15,738
128 × 128	7313	8275
256 × 256	3509	4285
512 × 512	1749	2148

Table 3. Performance of Swin Transformer (with or without two-dimensional discrete wavelet transform (2D DWT)) at multiple image sizes.

Mode Type	Image Size	Accuracy	F1-Score	Recall	Precision	FPR	FNR
Swin Transformer	32 × 32	0.8242	0.8237	0.8242	0.8265	0.1544	0.1949
	64 × 64	0.9939	0.9933	0.9938	0.9939	0.0038	0.0083
	128 × 128	0.9923	0.9917	0.9916	0.9917	0.0030	0.0142
	256 × 256	0.9942	0.9939	0.9940	0.9941	0.0011	0.0141
	512 × 512	0.9982	0.9974	0.9972	0.9975	0	0.0057
Swin Transformer (No DWT)	32 × 32	0.8142	0.8138	0.8141	0.8153	0.1669	0.2018
	64 × 64	0.9931	0.9929	0.9931	0.9930	0.0052	0.0142
	128 × 128	0.9923	0.9917	0.9916	0.9934	0.0047	0.0141
	256 × 256	0.9936	0.9937	0.9936	0.9938	0.0012	0.0182
	512 × 512	0.9954	0.9942	0.9957	0.9961	0.0076	0.0049

Table 4. Model comparison experiment results (image size: 512 × 512).

Mode Type	Evaluation Metrics
Mode Type	Accuracy	Precision	F1 Score	Recall	FPR	Time (ms)
Swin Transformer	0.9982	0.9974	0.9975	0.9974	0	19.163
ResNet18	0.9793	0.9859	0.9802	0.9949	0.0052	27.311
VGG16	0.9818	0.9765	0.9825	0.9887	0.0517	21.907
Inception	0.9644	0.9506	0.9649	0.9499	0.0303	12.802
Swin Transformer (No DWT)	0.9961	0.9962	0.9959	0.9962	0.0029	100.925

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Zhou, H.; Zhao, H.; Wang, Y.; Cheng, A.; Wu, J. A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks. Electronics 2024, 13, 1317. https://doi.org/10.3390/electronics13071317

AMA Style

Wang S, Zhou H, Zhao H, Wang Y, Cheng A, Wu J. A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks. Electronics. 2024; 13(7):1317. https://doi.org/10.3390/electronics13071317

Chicago/Turabian Style

Wang, Shanshan, Hainan Zhou, Haihang Zhao, Yi Wang, Anyu Cheng, and Jin Wu. 2024. "A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks" Electronics 13, no. 7: 1317. https://doi.org/10.3390/electronics13071317

APA Style

Wang, S., Zhou, H., Zhao, H., Wang, Y., Cheng, A., & Wu, J. (2024). A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks. Electronics, 13(7), 1317. https://doi.org/10.3390/electronics13071317

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Zero False Positive Rate of IDS Based on Swin Transformer for Hybrid Automotive In-Vehicle Networks

Abstract

1. Introduction

2. Background

2.1. Automotive Ethernet (AE)

2.2. Controller Area Network (CAN)

2.3. Use Case Study: Attacks on Hybrid Automotive In-Vehicle Networks

3. Proposed Method

3.1. Data Extraction

3.1.1. AE, CAN, and User Datagram Protocol (UDP) Message Formats

3.1.2. Introduction to the Dataset

3.2. Data Pre-Processing

3.3. Model Architecture

4. Experimental Results

4.1. Experimental Setup

4.2. Results of the Proposed Method and Re-Implemented Existing Methods

4.3. Summary of Performance Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI