Next Article in Journal
A Lightweight Authentication Method for Industrial Internet of Things Based on Blockchain and Chebyshev Chaotic Maps
Previous Article in Journal
Impact of Audio Delay and Quality in Network Music Performance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Human-AI Model for Enhanced Automated Vulnerability Scoring in Modern Vehicle Sensor Systems

by
Mohamed Sayed Farghaly
1,†,
Heba Kamal Aslan
1,2,† and
Islam Tharwat Abdel Halim
1,2,*,†
1
School of Information Technology and Computer Science (ITCS), Nile University, Giza 12677, Egypt
2
Center for Informatics Science (CIS), Nile University, 26th of July Corridor, Sheikh Zayed 12677, Egypt
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Future Internet 2025, 17(8), 339; https://doi.org/10.3390/fi17080339
Submission received: 15 June 2025 / Revised: 20 July 2025 / Accepted: 23 July 2025 / Published: 28 July 2025

Abstract

Modern vehicles are rapidly transforming into interconnected cyber–physical systems that rely on advanced sensor technologies and pervasive connectivity to support autonomous functionality. Yet, despite this evolution, standardized methods for quantifying cybersecurity vulnerabilities across critical automotive components remain scarce. This paper introduces a novel hybrid model that integrates expert-driven insights with generative AI tools to adapt and extend the Common Vulnerability Scoring System (CVSS) specifically for autonomous vehicle sensor systems. Following a three-phase methodology, the study conducted a systematic review of 16 peer-reviewed sources (2018–2024), applied CVSS version 4.0 scoring to 15 representative attack types, and evaluated four free source generative AI models—ChatGPT, DeepSeek, Gemini, and Copilot—on a dataset of 117 annotated automotive-related vulnerabilities. Expert validation from 10 domain professionals reveals that Light Detection and Ranging (LiDAR) sensors are the most vulnerable (9 distinct attack types), followed by Radio Detection And Ranging (radar) (8) and ultrasonic (6). Network-based attacks dominate (104 of 117 cases), with 92.3% of the dataset exhibiting low attack complexity and 82.9% requiring no user interaction. The most severe attack vectors, as scored by experts using CVSS, include eavesdropping (7.19), Sybil attacks (6.76), and replay attacks (6.35). Evaluation of large language models (LLMs) showed that DeepSeek achieved an F1 score of 99.07% on network-based attacks, while all models struggled with minority classes such as high complexity (e.g., ChatGPT F1 = 0%, Gemini F1 = 15.38%). The findings highlight the potential of integrating expert insight with AI efficiency to deliver more scalable and accurate vulnerability assessments for modern vehicular systems.This study offers actionable insights for vehicle manufacturers and cybersecurity practitioners, aiming to inform strategic efforts to fortify sensor integrity, optimize network resilience, and ultimately enhance the cybersecurity posture of next-generation autonomous vehicles.

1. Introduction

The automotive industry is a cornerstone of the global economy, with major automakers such as General Motors, Ford, and Toyota contributing significantly to gross domestic product (GDP) and technological innovation. Modern vehicles have evolved into complex cyber–physical systems, integrating advanced sensors (e.g., LiDAR and radar) and connectivity frameworks like Vehicle-to-Everything (V2X) to enable autonomous driving and real-time diagnostics [1,2]. However, this digital transformation has introduced unprecedented cybersecurity risks. Studies reveal that vulnerabilities in electronic control units (ECUs) and sensor networks expose vehicles to attacks such as LiDAR spoofing, Global Positioning System (GPS) manipulation, and denial-of-service (DoS) incidents [3,4].
Several efforts have been made to address the challenge of cybersecurity risk quantification in the automotive domain. Notably, the HEAVENS 2.0 framework—refined from its original version and formalized in ISO/SAE 21434—provides a risk-oriented model for assessing threats to automotive systems [5,6]. The Threat Analysis and Risk Assessment (TARA) approach, initially proposed by Intel, has also been adapted for vehicle security assessments, offering a structured method to evaluate attack paths and potential impacts [7]. Additional strategies include attack trees, adaptations of the Common Vulnerability Scoring System (CVSS), and model-based assessments that integrate functional safety standards like ISO 26262 [8]. While these methodologies significantly contribute to the domain, recent academic and industry reviews emphasize that they vary in scope, terminology, and applicability across subsystems [9,10]. A unified and comparative framework to quantify and contrast cybersecurity risks across heterogeneous automotive components—such as sensors, electronic control units (ECUs), and communication buses—has not yet been fully realized, particularly in safety-critical domains such as collision avoidance and autonomous navigation [11].
During the past decade, some attacks have been simulated, while others have been emulated; however, real incidents have been recorded that affect certain vehicles [12]. These recorded incidents highlight the growing vulnerabilities in automotive systems, particularly in autonomous and connected vehicles. The diverse nature of these attacks, ranging from malware injection and impersonation to jamming and Sybil attacks, underscores the need for robust security measures to safeguard critical vehicle systems. As vehicular technologies continue to evolve, it is imperative to address these security challenges through comprehensive testing, real-time monitoring, and the implementation of advanced defensive strategies to mitigate potential risks [13].
The Common Vulnerability Scoring System (CVSS) is a standardized framework for assessing and communicating the severity of security vulnerabilities [14]. Maintained by the Forum of Incident Response and Security Teams (FIRST), it evaluates vulnerabilities based on intrinsic characteristics, temporal factors, and environmental context. CVSS is widely adopted in IT and automotive cybersecurity, and its flexibility allows adaptation to other critical industries. In healthcare, CVSS has been used to assess vulnerabilities in medical Internet of Things (IoT) devices, such as insulin pumps and pacemakers, where availability and integrity directly impact patient safety [15]. In the energy sector, it aids in prioritizing vulnerabilities in industrial control systems (ICSs), including those affecting smart grid power distribution networks [16]. Financial institutions leverage CVSS to evaluate risks in fintech applications, particularly in blockchain and digital payment systems [17]. Additionally, in the IoT domain, CVSS is applied to assess risks in smart home devices, highlighting the need for environmental metrics to address context-specific threats, such as unauthorized access to security cameras [18].
However, traditional CVSS scoring methods often rely on static rules and manual interpretation, which can lead to inconsistencies, oversimplifications, or delays in rapidly evolving threat landscapes. This static nature is particularly problematic in the dynamic and safety-critical automotive domain, where highly interconnected cyber–physical systems, heterogeneous components, and the rapid pace of technological advancements introduce complex, often context-dependent, vulnerabilities that traditional CVSS struggles to accurately assess and prioritize in real time. In addition, traditional CVSS often fails to adapt to evolving threats, leading to the need for more advanced solutions. Therefore, utilizing automated solutions seems essential, also considering that they can provide an estimate of the severity [19,20].
Today, generative AI represents a transformative subset of artificial intelligence that focuses on creating new content or solutions by learning patterns from existing data. This innovation contrasts with traditional AI, which mainly deals with classification, prediction, and regression tasks [21]. In the context of cybersecurity, generative AI holds promise in improving vulnerability management practices [22]. Crucially, its capacity to analyze vast datasets, learn dynamic threat patterns, and generate context-aware insights makes it exceptionally well-suited for addressing the unique and intricate cybersecurity challenges within complex automotive systems. Generative AI can improve automated vulnerability scoring by analyzing historical data to forecast potential vulnerabilities and adjusting severity scores based on dynamic risk indicators [23]. This shift allows organizations to prioritize vulnerabilities based on actual organizational impact rather than just raw numbers, optimizing remediation efforts and resource allocation. Additionally, generative AI can assist in vulnerability detection by refining the process and reducing false positives. By analyzing service configurations and detection mechanisms, AI can improve the precision of identifying legitimate vulnerabilities. As generative AI continues to evolve, its application in cybersecurity presents an opportunity to create more resilient and efficient systems for vulnerability management, ultimately improving organizational security in an increasingly complex threat landscape.
Addressing the critical need for a more dynamic, automated, and context-aware approach to vulnerability assessment in autonomous vehicles, especially given the limitations of traditional CVSS in this highly complex and safety-critical domain, this paper addresses this gap by proposing a hybrid generative AI model for enhanced automated vulnerability scoring tailored to autonomous vehicles. Our contributions are as follows:
  • Development of a sensor-centric vulnerability taxonomy tailored to autonomous vehicles, identifying and categorizing over 15 unique attack types. The taxonomy highlights LiDAR and radar systems as the most frequently targeted components, due to their critical role in autonomous perception and control.
  • Application of generative AI models for context-aware vulnerability scoring to evaluate and utilize state-of-the-art generative AI platforms (e.g., ChatGPT, Gemini, Copilot, and DeepSeek) to enhance automated scoring by incorporating contextual factors and dynamic threat indicators often missed by static methods.
  • Development of a hybrid vulnerability scoring model to integrate expert-driven knowledge with generative AI outputs to adapt and extend CVSS v4.0 for the specific context of autonomous vehicle sensor systems.
The remainder of this paper is organized as follows: Section 2 presents the background and related work. Section 3 outlines the methodology of this work. Section 4 introduces a comprehensive sensor-centric vulnerability taxonomy tailored to autonomous vehicle systems, highlighting over 15 unique attack types. Section 5 describes the assessment of state-of-the-art generative AI models for context-aware vulnerability scoring, evaluating their effectiveness in enhancing traditional CVSS metrics. Section 6 details the proposed hybrid vulnerability scoring model that integrates expert insights with generative AI outputs. Section 7 concludes the paper and outlines future work.

2. Background and Related Work

2.1. Background

2.1.1. Cybersecurity Vulnerabilities in Autonomous and Connected Vehicular Systems

Autonomous vehicles face several inherent vulnerabilities across their systems. electronic control units (ECUs) are susceptible to attacks due to their proprietary code, which can be re-flashed with modified firmware, tampered memory, or manipulated security keys, especially when physical access is gained. Onboard Diagnostics (OBD) ports are vulnerable to exploitation through the Controller Area Network (CAN) protocol, which lacks encryption and digital signatures, making it prone to replay and denial-of-service attacks. GPS sensors are exposed to manipulation due to open access to data and unencrypted civil signals, with weak terminal capabilities to distinguish false signals from real ones. LiDAR and ultrasonic sensors rely heavily on signal integrity, making them susceptible to spoofing and jamming attacks. Connected car systems, including Wi-Fi, Bluetooth, and mobile internet technologies, often lack robust security measures, while infotainment systems with accessible physical interfaces like USB and CD/DVD ports provide easy entry points for attackers. Firmware is at risk of extraction and decoding, exposing sensitive data such as encryption keys and intellectual property. Localization and navigation technologies, including GPS and HD maps, suffer from weak signal processing, authentication mechanisms, and insecure update frameworks. These vulnerabilities underscore the need for comprehensive cybersecurity measures to safeguard autonomous vehicle systems [24].
The security landscape of vehicular networks also presents a multifaceted array of vulnerabilities and attack vectors, critical for consideration in scientific discourse. In traditional vehicular ad hoc networks, inherent and external factors such as high mobility, real-time communication demands, decentralized architecture, and the heterogeneity of devices contribute to complex attack surfaces. At the physical layer, these networks are susceptible to jamming, eavesdropping, GPS spoofing, interference, node tampering, and impersonation attacks. The data link layer faces threats including traffic analysis, beaconing, replay, identity spoofing, collision, resource exhaustion, bit flipping, and man-in-the-middle attacks. Network layer vulnerabilities encompass Sybil, wormhole, replay, black hole, gray hole, routing table overflow, impersonation, location disclosure, tunneling, packet dropping, malicious flooding, node impersonation, collusion, and routing loop attacks. At the transport layer, vehicular networks face attacks like session hijacking, man-in-the-middle, and various forms of denial of service (DoS). The application layer is vulnerable to malware, phishing, data tampering, and privacy breaches. The emergence of Software-Defined Networking (SDN) in vehicular networks introduces further security challenges due to its complex architecture, centralized control plane, and dynamic topology. SDN-based systems are susceptible to attacks across their data, control, and application planes, including spoofing, flow table overflows, controller compromises, DoS, API exploitation, and cross-layer attacks [25].
The cybersecurity landscape of modern vehicular systems is complex, with various attacks directly correlating to identified vulnerabilities. Blinding and jamming attacks, which involve saturating detectors with infrared light pulses or interfering with laser signals, exploit the reliance of LiDAR and ultrasonic sensors on signal integrity. Network-centric threats, such as black hole attacks, manifest as disruptions where data packets are dropped, preventing their intended network traversal. Similarly, timing attacks introduce delays in time-sensitive communications, thereby disrupting vehicle coordination and safety-critical applications. Physical manipulation can lead to disruptive attacks, for instance, by placing an electromagnetic actuator to interfere with wheel speed sensors, touching upon vulnerabilities related to physical access to components like ECUs. Replay attacks, which involve intercepting and retransmitting or delaying valid data, are particularly pertinent to the Controller Area Network (CAN) protocol accessible via OBD ports, given its lack of encryption and digital signatures. Relay attacks, where signals are intercepted and forwarded to a remote receiver, highlight broader communication vulnerabilities in connected car systems. The passive monitoring of sensor transmissions constitutes an eavesdropping attack, posing a significant threat to location privacy. Sybil attacks exploit network layer vulnerabilities by creating multiple fake identities to manipulate the network. Furthermore, specific sensor weaknesses are targeted by blind spot exploitation attacks, which leverage objects of minimal thickness to avoid detection in blind spot regions, and sensor interference attacks, where ultrasonic sensors are positioned to cause signal disruption. Acoustic cancellation attacks transmit inverted-phase signals to neutralize legitimate sensor signals, further emphasizing the susceptibility of sensory systems. Impersonation attacks, by mimicking identities or communication patterns, exploit general trust and identity management issues. Finally, falsified-information attacks involve spreading misleading data to manipulate sensors, while cloaking attacks modify attack signatures to evade detection, underscoring the need for robust cybersecurity measures across all vehicular system layers [24,25].

2.1.2. Common Vulnerability Scoring System

The Common Vulnerability Scoring System (CVSS) is a standardized framework designed to quantify and communicate the severity of security vulnerabilities in a systematic and reproducible manner. Developed and maintained by the forum of incident response and security teams (FIRST), CVSS provides a structured methodology for evaluating vulnerabilities based on their intrinsic characteristics, temporal factors, and environmental context. The system generates a numerical score ranging from 0.0 to 10.0, where higher values indicate greater severity, enabling stakeholders to prioritize remediation efforts effectively. The Common Vulnerability Scoring System (CVSS) v4.0 assesses vulnerabilities using four metric groups: base, threat, environmental, and supplemental [14] as shown in Figure 1.
The CVSS framework has evolved through various versions to refine its approach to vulnerability severity assessment. While earlier iterations, such as CVSS v3.x, primarily utilized Base, Temporal, and Environmental metric groups, CVSS v4.0, which is adopted in this study, introduces significant enhancements and structural changes. Notably, CVSS v4.0 renames the ‘Temporal’ metric group to ‘Threat’ and incorporates a new ‘Supplemental’ metric group, alongside the retained ‘Base’ and ‘Environmental’ groups. These changes aim to provide a more nuanced and adaptive framework for evaluating vulnerabilities, particularly by better addressing real-time threat intelligence and allowing for the inclusion of organization-specific contextual factors that go beyond the core scoring mechanism.
For this study, the base group is prioritized to evaluate vulnerabilities in automotive systems. The exploitability metrics assess the characteristics of the “vulnerable system”, which refers to the entity directly susceptible to vulnerability. These metrics evaluate the properties of the vulnerability that enable a successful attack, assuming the attacker has advanced knowledge of the target system, including its general configuration and default defense mechanisms. Target-specific mitigations, such as custom firewall rules, are not considered in the base metrics but are instead reflected in the environmental metrics. Specific configurations required for an attack should not influence the base metrics; the vulnerable system is assessed assuming it is in the necessary configuration for exploitation.
  • The Exploitability Sub-Score (ESS): reflects the ease with which a vulnerability can be exploited. It is determined by four metrics—Attack Vector (AV), Attack Complexity (AC), Privileges Required (PR), User Interaction (UI), and Attack Requirements (AT). Each has specific values that contribute to the overall score. The formula for calculating the ESS is illustrated in Equation (1) (all equations, as well as the values specified below, are extracted from [14]):
    ESS = 8.22 × A V × A C × P R × U I × A T
    – 
    Attack Vector (AV): This reflects the relative ease with which an attacker can access the vulnerable component, effectively indicating the remoteness of the attack. It comprises four standardized values, Network (N), Adjacent (A), Local (L), and Physical (P), each associated with a specific numerical weight that contributes to the Exploitability Sub-Score (ESS). A Network (N) vector, valued at 1.0, implies that the vulnerability can be exploited remotely over one or more network hops, such as in a denial-of-service attack triggered via a crafted TCP packet. An Adjacent (A) vector, with a value of 0.77, indicates the attacker must be in a logically adjacent topology, such as the same subnet, as in an ARP flooding attack. A Local (L) vector, valued at 0.62, requires the attacker to have local access or to rely on user interaction—such as social engineering—to trick a user into executing a malicious file. A Physical (P) vector, with a score of 0.29, represents scenarios where the attacker must physically interact with the device, as in a cold boot attack aimed at extracting encryption keys. This metric highlights that the broader the potential attack surface, the greater the number of potential attackers and the higher the severity score.
    – 
    Attack Complexity (AC): This metric evaluates the level of difficulty an attacker faces in successfully exploiting a vulnerability, considering any conditions beyond the attacker’s control. It has two possible values: Low (L) and High (H). A Low (L) complexity, assigned a score of 0.77, implies that the exploit is straightforward, repeatable, and does not depend on specific conditions or defenses in the target environment. Conversely, a High (H) complexity, valued at 0.44, indicates that the attacker must overcome additional barriers such as security mechanisms—e.g., Address Space Layout Randomization (ASLR) and Data Execution Prevention (DEP)—or must possess target-specific knowledge like cryptographic keys. Thus, a higher attack complexity reduces the likelihood of successful exploitation and leads to a lower severity score.
    – 
    Privileges Required (PR): This metric quantifies the level of access privileges an attacker must possess prior to initiating an exploit. It includes three predefined values: None (N), Low (L), and High (H). A None (N) value, scored at 0.85, signifies that the attack can be conducted without any prior access to the vulnerable system. A Low (L) value, with a score of 0.62, indicates that basic user-level access is required. A High (H) value, rated at 0.27, denotes that the attacker must already have elevated privileges such as administrative or root-level access. The more privileged access required, the less severe the vulnerability is considered to be.
    – 
    The User Interaction (UI): This metric assesses whether the successful exploitation of a vulnerability depends on any user action. It has two possible values: None (N) and Required (R). A None (N) value, scored at 0.85, indicates that no user involvement is needed—an attacker can exploit the vulnerability independently. In contrast, a Required (R) value, assigned a score of 0.62, means that a user must perform an action (e.g., opening a malicious file or clicking a crafted link) to enable the attack. The required user interaction reduces exploitability and lowers the overall score.
    – 
    Attack Requirements (AT): This metric evaluates whether specific conditions or unusual configurations, beyond the attacker’s control, must be present on the vulnerable system for a successful exploit. It has two possible values: None (N) and Present (P). A None (N) value, scored at 1.0, indicates that the vulnerability is exploitable under typical, default, or common configurations, requiring no special environmental factors for a successful attack. In contrast, a Present (P) value, assigned a score of 0.90, means that successful exploitation depends on additional conditions, such as specific non-default configurations, unusual system states, or other prerequisites on the target. The presence of such requirements reduces the overall exploitability of the vulnerability and lowers the overall score.
  • Impact sub-score (ISS): This quantifies the consequences of a successfully exploited vulnerability. It considers the effect on confidentiality, integrity, availability, safety, and the automated process of the affected systems. The formula for calculating the ISS is given in Equation (2) (all equations, as well as the values specified below, are extracted from [14]):
    ISS = 1 ( 1 C ) ( 1 I ) ( 1 A )
    – 
    Confidentiality (C): This metric reflects the degree of information exposure resulting from a vulnerability. It has three possible values: High (H), Low (L), and None (N). A High (H) impact, valued at 0.56, indicates complete compromise of confidential information—e.g., all sensitive data is accessible. A Low (L) impact, scored at 0.22, represents the partial disclosure of non-critical or limited data. A None (N) value, with a score of 0.0, means there is no impact on confidentiality.
    – 
    Integrity (I): This metric evaluates the degree to which data can be modified or destroyed by an attacker. The values mirror those of confidentiality. A High (H) impact, valued at 0.56, indicates total compromise—critical data can be arbitrarily altered. A Low (L) impact, scored at 0.22, refers to the partial modification of non-critical data. A None (N) value, rated at 0.0, implies no integrity violation.
    – 
    Availability (A): This metric considers the extent to which the vulnerability impacts the availability of the affected component. A High (H) impact, with a score of 0.56, corresponds to a total service disruption or sustained denial of service. A Low (L) impact, scored at 0.22, implies intermittent or degraded availability. A None (N) value, scored at 0.0, means that the availability of the system remains unaffected.
  • Base Score (BS): This represents the intrinsic severity of a vulnerability in CVSS v4.0. It is derived by summing the Exploitability Sub-Score (ESS) and the Impact Sub-Score (ISS), then applying a maximum threshold of 10. The result is rounded up to one decimal place as shown in the formula
    BS = round_up ( min ( ESS + ISS , 10 ) )

2.1.3. Vehicle Safety System (In-Vehicle Network)

Autonomous and connected vehicles are transforming transportation through the integration of three key systems: safety, connectivity, and diagnostics. These systems, along with their supporting sensor technologies, enable advanced functionalities and enhance the driving experience. However, this increased complexity also introduces significant security vulnerabilities. This section examines the core systems and sensor technologies of autonomous and connected vehicles, outlines fundamental security requirements, and analyzes common attack vectors that threaten their operation.
The evolution of autonomous and connected vehicles hinges on the integration of three foundational systems—safety, connectivity, and diagnostics—each serving distinct yet synergistic roles in redefining modern transportation. As illustrated in Figure 2, these systems collectively enhance vehicular functionality, prioritize occupant safety, and elevate user experience through advanced technological frameworks.
The Safety system comprises active and passive safety mechanisms. Active safety systems assist drivers in accident prevention by utilizing sensing, processing, and actuation technologies. In contrast, passive safety systems include components such as airbags that mitigate risks and protect occupants in the event of an accident [26]. The Connectivity system encompasses all connected vehicle features and is divided into three subcategories: Vehicle-to-Everything (V2X) communication, in-vehicle connections, and smart connected features that enhance user comfort [26].
V2X communication, illustrated in Figure 3, consists of four key types of connectivity: Vehicle-to-Vehicle (V2V), Vehicle-to-Infrastructure (V2I), Vehicle-to-Network (V2N), and Vehicle-to-Pedestrian (V2P). V2P facilitates the direct communication between vehicles and vulnerable road users. V2I enables vehicles to share information with infrastructure-related devices such as cameras and traffic lights. V2V allows vehicles to exchange data regarding location, speed, and other parameters to improve traffic flow and prevent collisions. V2N leverages cloud services to enhance communication between vehicles, traffic management systems, and road users [26].
The Diagnostics system, depicted in Figure 4, describes the methods through which vehicles can be accessed for diagnostic and software updates. This can be achieved either remotely over-the-air (OTA) or through a physical connection. OTA updates enable the Original Equipment Manufacturer (OEM) server to connect to the vehicle’s main electronic control unit (ECU). Alternatively, physical diagnostics can be performed using On-Board Diagnostics (OBD) ports [26].

2.1.4. Sensor Technologies in Autonomous Vehicles

Autonomous vehicles rely on various sensor technologies, which are integral to safety and connectivity systems. This section focuses on key sensors used in autonomous vehicles, particularly in relation to the attack vulnerabilities that will be discussed later.
  • LiDAR: Light Detection and Ranging (LiDAR) provides a 360-degree view using laser channels. It maps the surrounding environment in 3D using mechanical, semisolid-state, and solid-state systems [3].
  • Radar: Millimeter-wave radar detects non-transparent materials and supports functions such as blind spot detection and parking assistance [3].
  • GPS: The Global Positioning System enables geographical location tracking by communicating with satellites. Its openness makes it vulnerable to cyberattacks [4].
  • Magnetic Encoders: These measure angular velocity using magnetoresistance or Hall ICs and are essential for the Anti-lock Braking System (ABS) and TPMS systems [4].
  • TPMS: Tire Pressure Monitoring Systems include four sensors and an ECU. They transmit tire pressure data securely by filtering out unauthorized IDs [4].
  • Camera: Cameras support object detection, traffic sign recognition, parking, collision avoidance, and night vision capabilities [4].
  • Ultrasonic Sensors: These detect nearby obstacles and are primarily used for low-speed operations like parking [4].

2.1.5. Generative AI in Cybersecurity

Generative AI (GenAI) has emerged as a valuable tool for social engineering, mainly because large language models (LLMs) can understand and generate human-like language. This makes them useful for identifying potential victims of targeted phishing (spear phishing) and crafting convincing, personalized messages [27]. According to a recent report by the UK’s National Cyber Security Centre (NCSC), AI is expected to enhance the social engineering capabilities of a wide range of cyber threat actors [28]. This includes state-sponsored hackers with advanced skills, organized cybercriminal groups with some limitations in resources, and even less-experienced opportunistic hackers.
Since GPT-4 became publicly available in March 2023, there has been no noticeable spike in the discovery of new types of malware in the wild. This suggests that, while GenAI has potential, it does not yet have the technical ability or training needed to autonomously develop fully functioning malware. The NCSC supports this view, noting that AI in cyberattacks represents more of a gradual change in threat level, rather than a sudden, game-changing shift. It is also worth noting that not every cyberattack involves malware. In fact, 75% of the attacks detected last year did not use malware at all [29]. Instead, cybercriminals have increasingly focused on identity-based attacks—like phishing and social engineering—because they are often more effective and harder to trace. Improved antivirus tools, better platform security, and faster response systems have made traditional malware attacks more difficult and less appealing to sophisticated attackers.
That said, there is still a chance GenAI could tip the scales back toward malware-driven attacks in the future. It is of vital importance to summarize how GenAI might one day be used to automate the creation of malware, discover software vulnerabilities, and shape the longer-term evolution of cyber threats.
  • Threat Intelligence and Adaptive Threat Detection: Generative AI refines threat intelligence by efficiently filtering vast data to prioritize organization-specific risks, reducing noise, and learning from interactions to detect anomalies, enabling rapid adaptation and clear AI-generated risk summaries [30].
  • Predictive and Vulnerability Analysis: GenAI predicts future cyber threats and identifies critical vulnerabilities by analyzing past attacks, enabling organizations to prioritize high-risk areas, proactively strengthen security, and reduce exposure to potential exploits [31].
  • Malware Analysis and Biometric Security: Generative AI enables researchers to create synthetic data and realistic malware samples, safely studying threat behaviors and enhancing biometric security, ultimately strengthening cybersecurity measures through controlled experimentation and analysis [32].
  • Development Assistance and Coding Security: Generative AI assists developers by providing real-time feedback, promoting secure coding practices, and flagging risks early. It learns from past examples to prevent errors and enhance software security throughout development [33].
  • Alerts, Documentation, and Incident Response: Generative AI streamlines alert management by summarizing complex data, improving clarity and response time. It helps cybersecurity teams prioritize threats accurately and offers actionable recommendations for effective risk mitigation [34].
  • Employee Training and Education: Generative AI creates interactive training modules to educate employees on cybersecurity and protocols, reducing human error and strengthening organizational defenses by promoting awareness and adherence to security best practices [35].

2.1.6. The Common Vulnerabilities and Exposures (CVE)

To ensure a standardized and widely understood approach to identifying and discussing cybersecurity flaws, our work references Common Vulnerabilities and Exposures (CVE). CVE is not a database in itself but rather a dictionary or catalog of publicly known information security vulnerabilities and exposures. Established and maintained by the MITRE Corporation with U.S. government funding, the primary purpose of the CVE program is to provide unique, standardized identifiers (known as CVE IDs, e.g., CVE-2024-12345) for specific security vulnerabilities found in software, firmware, or hardware [36].
Each CVE ID comes with a brief, standardized description of the vulnerability. This common naming convention is crucial because it allows security professionals, researchers, vendors, and organizations worldwide to refer to the exact same issue using a universal language. Without CVE, different entities might use their own names for the same vulnerability, leading to confusion, missed patches, and inefficient communication.
While the CVE list itself provides only a basic description and identifier, these CVE IDs often serve as a key reference point for more detailed information found in other public resources, such as the National Vulnerability Database (NVD) [37]. The NVD enriches CVE entries with additional details, including severity scores (often using the Common Vulnerability Scoring System, CVSS), potential impacts, and suggested remediation steps. In essence, CVE acts as the common ‘name tag’ for a vulnerability, facilitating efficient information sharing, coordinated response efforts, and improved overall cybersecurity posture across the global community. Organizations use CVEs to track, prioritize, and address security threats effectively.

2.2. Related Work

Building upon this general overview of generative AI’s role and potential across various cybersecurity domains, the following section reviews specific existing literature focusing on AI’s applications in vulnerability management and the Common Vulnerability Scoring System (CVSS), which directly informs our proposed methodology.
Generative AI has emerged as a transformative tool in the realm of vulnerability management, enhancing the capabilities of automated vulnerability scoring systems. By leveraging advanced machine learning techniques, organizations can improve the efficiency and accuracy of their vulnerability assessments.
LLMs have significantly advanced generative AI tasks, becoming central to applications such as text generation, translation, and summarization. However, their widespread use has brought increasing concern over security vulnerabilities that could compromise the integrity and reliability of their outputs. Two of the most critical threats identified are prompt injection and training data poisoning attacks, which can lead to biased results, the spread of misinformation, and the generation of malicious content. To address these issues, the work in [38] proposes extending the CVSS framework to assess the severity of LLM-specific vulnerabilities. This approach provides a standardized method for evaluating and understanding these risks, allowing organizations to prioritize mitigation strategies, allocate resources effectively, and implement targeted security measures to protect against potential threats.
The work in [20] addresses the challenge of manually assessing newly published vulnerabilities using the CVSS, which is time-consuming and requires expertise. Previous attempts to predict CVSS vectors or scores using machine learning have typically relied on textual descriptions from databases like the National Vulnerability Database (NVD). In the proposed work, the authors expand on this approach by incorporating additional publicly available web pages referenced by the NVD through web scraping. A Deep Learning-based method is developed and evaluated for predicting CVSS vectors. The paper classifies the reference texts based on their suitability and crawlability. Although the inclusion of these additional texts has a negligible impact, the Deep Learning model outperforms existing methods in predicting CVSS vectors, improving the efficiency of vulnerability assessment.
In [19], the authors explore the use of artificial intelligence models to assist or replace human experts in evaluating the severity of vulnerabilities, given the large volume of vulnerabilities that need assessment. The authors compare the performance of the Universal Sentence Encoder, Generative Pre-trained Transformer, and Support Vector Machine models, trained on 118,000 vulnerabilities and tested on 51,000, against human experts in terms of precision, recall, F1 score, and mean estimation error. The Universal Sentence Encoder outperforms human experts with 72–77% accuracy in predicting severity levels, showing high efficiency in memory usage and processing time. The study further evaluates the models’ effectiveness in predicting vulnerability evaluation components and severity levels. The findings demonstrate the potential of AI to significantly enhance the efficiency and accuracy of vulnerability severity assessments, which is currently a fully manual process.
The work in [22] addresses the limitations of traditional vulnerability assessment methods, which rely on expert knowledge and are time-consuming. Given the increasing number of vulnerabilities, the authors propose using machine learning (ML) solutions for more efficient vulnerability severity assessment. The paper focuses on improving existing ML-based methods that predict the CVSS score or its vector metrics. Recognizing that the quality and diversity of vulnerability descriptions greatly impact prediction accuracy, the authors use generative AI (GPT-3.5 Turbo) to generate CVSS descriptions and introduce a fine-tuned BERT-CNN model to predict CVSS vector metrics. Through experiments using both original and AI-generated data, the authors demonstrate that their proposed architecture significantly enhances the accuracy of vulnerability assessment, outperforming state-of-the-art approaches.
In [39], the authors discuss the VULTURE project, an AI-powered cybersecurity platform designed to address the challenges posed by increasingly sophisticated cyberattacks. Traditional cybersecurity methods struggle to keep up with the speed and complexity of modern threats. VULTURE leverages generative AI (GenAI) and large language models (LLMs) to enhance vulnerability prediction, automate penetration testing, improve intrusion detection, and enable advanced cyber–physical risk profiling. The work explores the platform’s architecture, key innovations, potential impact, and outlines future research directions.
By and large, generative AI can significantly bolster vulnerability management by providing organizations with sophisticated tools for code analysis and predictive insights. This technology allows for the automation of routine tasks, enabling security teams to focus on higher-level strategic activities. For instance, generative AI can swiftly identify and respond to emerging threats, matching the speed and sophistication of malicious actors. Furthermore, selecting an appropriate generative AI model is crucial for effective vulnerability detection. Organizations should consider various generative models, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), tailored to their specific needs and the nature of their data [16]. The integration of these models facilitates more precise vulnerability scoring by enabling the analysis of large datasets and the identification of hidden patterns. Hence, the effectiveness of generative AI in vulnerability scoring relies heavily on the quality and diversity of the training data used. High-quality datasets are essential for training models to recognize potential vulnerabilities accurately.
In short, the evaluation of AI models’ robustness against adversarial attacks using Common Vulnerability Scoring System (CVSS) metrics provides critical insights into their cybersecurity preparedness. AI models can be assessed for resilience by simulating adversarial attacks and generating CVSS scores through the systematic evaluation of attributes such as attack vector, attack complexity, and required privileges.

3. Methodology

This study employs a three-phase methodology to evaluate cybersecurity vulnerabilities in autonomous vehicles: (1) a systematic literature review, (2) CVSS v4.0 scoring of identified vulnerabilities, and (3) assessing LLM robustness in extracting CVSS attributes.

3.1. Phase 1: Systematic Literature Review

A systematic literature search was conducted to identify relevant studies on cybersecurity vulnerabilities and attacks in autonomous vehicle sensor systems. The search, spanning publications between January 2018 and May 2024, utilized comprehensive keywords such as “automotive cybersecurity”, “autonomous vehicle security”, “sensor attacks”, “LiDAR vulnerability”, “radar cybersecurity”, and “ultrasonic sensor security”. Relevant articles were identified from various academic platforms and publishers, including IEEE Xplore, European Open Science, Elsevier, SCITepress, and MDPI. Additionally, arXiv was consulted for recent, emerging research that may not yet have undergone formal peer review, acknowledging its status as a preprint repository. Some additional articles were also identified through other relevant sources.
Initial searches across these platforms yielded [e.g., “approximately 500”] results. These results were then meticulously screened based on their titles and abstracts for direct relevance to automotive cybersecurity and sensor systems. A total of 16 articles and relevant industry reports were ultimately selected for detailed review, guided by the following stringent criteria:
  • Empirical Validation: The inclusion of experimental or simulation-based studies (e.g., LiDAR jamming tests [4]).
  • Real-World Incidents: The documentation of actual cybersecurity breaches in autonomous vehicles (e.g., Tesla OBD malware injection [12]).
  • Sensor-Specific Analysis: Focus on the cybersecurity of LiDAR, radar, and ultrasonic sensor systems.

3.2. Phase 2: CVSS v4.0 Scoring

The vulnerabilities identified in Phase 1 were assessed using the CVSS v4.0 framework. To ensure relevance to the automotive context, the assigned CVSS scores were validated against automotive-specific criteria detailed in [40,41]. This validation process considered factors such as the following:
  • Exploitability of vulnerabilities within vehicular communication channels (e.g., CAN bus vulnerabilities).
  • The potential impact on the automated driving system.
The methodology for assessing cybersecurity risks in the automotive domain is built upon a targeted survey. This survey was distributed to cybersecurity risk analysts specializing in the automotive domain, and we are pleased to report a (100%) response rate, with 10 experts providing their valuable insights. These individuals are not general automotive industry experts but rather seasoned cybersecurity risk analysts, each possessing over a decade of dedicated experience directly in vehicle security systems and automotive product development. Their expertise extends across critical areas, including an in-depth understanding of in-vehicle network architectures, proficiency in relevant automotive cybersecurity standards (e.g., ISO 21434, UNECE WP.29 R155), extensive experience in automotive-specific threat modeling and risk assessment, and practical knowledge of cybersecurity testing methodologies. The primary goal of this survey was to leverage their granular knowledge to validate the CVSS v4.0 scoring attributes for each of the 15 identified attack types, thereby ensuring robust alignment with real-world threat assessment. The survey was meticulously designed to evaluate each attack scenario specifically using the CVSS v4.0 metrics, drawing directly on their deep specialized insights from years of practical experience in this critical and evolving field. The structured questionnaire was used to collect expert evaluations for each individual attack. Each attack was described using the standardized format outlined in Table 1, ensuring consistency and minimizing ambiguity. The questionnaire covered all base metrics of the CVSS v4.0 framework, and each question included predefined answer choices corresponding to the standardized metric values as shown below:
  • Attack Vector (AV): What level of proximity is required to exploit the vulnerability?
    – 
    Network (N): Exploitable remotely (e.g., via the internet);
    – 
    Adjacent (A): Requires same local network (e.g., Bluetooth and Wi-Fi);
    – 
    Local (L): Requires local access (e.g., USB port or internal interface);
    – 
    Physical (P): Requires direct physical contact (e.g., hardware tampering).
  • Attack Complexity (AC): How difficult is it to successfully execute the attack?
    – 
    Low (L): Easy to exploit, no significant preconditions;
    – 
    High (H): Requires advanced techniques or overcoming security mechanisms.
  • Attack Requirements (AT): Are any specific conditions or configurations needed for the attack?
    – 
    None (N): Works with standard/default settings;
    – 
    Present (P): Requires specific or non-default settings (e.g., debug mode enabled).
  • Privileges Required (PR): What level of system privileges must an attacker have to perform the attack?
    – 
    None (N): No prior access required;
    – 
    Low (L): Basic user-level access;
    – 
    High (H): Administrative or elevated privileges.
  • User Interaction (UI): Does the attack require any user action to succeed?
    – 
    None (N): Fully automated, with no user input needed;
    – 
    Required (R): Needs user engagement (e.g., clicking a malicious link).
  • Confidentiality (C): To what extent does the attack compromise data confidentiality?
    – 
    None (N): No data exposure;
    – 
    Low (L): Limited or non-sensitive data exposure;
    – 
    High (H): Full or critical data disclosure.
  • Integrity (I): To what extent can the attack modify or corrupt data?
    – 
    None (N): No data tampering;
    – 
    Low (L): Minor or non-critical data changes;
    – 
    High (H): Complete or critical data manipulation.
  • Availability (A): What is the impact on system availability?
    – 
    None (N): No effect on operations;
    – 
    Low (L): Performance degradation or intermittent disruption;
    – 
    High (H): Total or persistent service failure.
These responses were used to generate tailored CVSS scores and to validate the exploitability and impact severity of each attack type in the context of autonomous vehicles.
Table 1. Summary of attacks on autonomous vehicle sensors and systems.
Table 1. Summary of attacks on autonomous vehicle sensors and systems.
TypeAttackDescriptionLiDARRadarGPSMagnetic EncoderTPMCameraUltrasonic SensorReference
Blinding AttackEmitting infrared light pulses matching the sensor’s wavelength, saturating its detectors and causing service denialYesNoNoNoNoYesYes[4,12,42]
Jamming AttackEmitting light at the same frequency as the LiDAR’s laser, directly interfering with the sensor’s laser signal.YesYesYesNoNoNoYes[4,12,24,40,43]
DOSBlack-hole attacksDrops data packets instead of forwarding them to their intended destination, creating a disruption where packets are unable to traverse the network to other vehicles.YesYesYesNoNoNoNo[3,4,44,45,46]
Timing attacksIntroduces delays in time-sensitive communications, disrupting vehicle coordination and safety-critical applications.YesYesNoNoNoNoNo[3,43,45,47]
Disruptive AttackPlacing an electromagnetic actuator between the wheel speed sensors—exposed beneath the vehicle body—and the ABS tone wheelNoNoNoYesNoNoNo[4,26,48]
Replay attacksIntercepting and retransmitting or delaying valid data transmissions.YesYesNoNoNoNoNo[3,4,11,26,41,48,49]
MitMRelay AttackIntercepts signals and forwards them to a remote receiverYesYesNoNoNoNoNo[4,26,41,43]
Eavesdropping AttackPassive monitoring of sensor transmissions, posing a significant threat to location privacyNoNoNoNoYesNoNo[4,11,12,24,45,48]
Sybil attacksCreates multiple fake identities to manipulate a network.YesYesNoNoNoNoNo[3,40,48,49]
Blind Spot Exploitation AttackExploits detect objects of minimal thickness within blind spot regions by placing a thin object in the vehicle’s blind spot.NoNoNoNoNoNoYes[12,24,26,43]
Sensor Interference AttackPositions ultrasonic sensors opposite a target vehicle’s sensors, causing signal interferenceNoNoNoNoNoNoYes[12,24,26,43]
SpoofingAcoustic Cancellation AttackTransmitting an inverted-phase signal that neutralizes legitimate sensor signalsNoNoNoNoNoNoYes[4]
Impersonation attacksMimicking identities, credentials, or communication patterns to trick victimsYesYesNoNoNoNoNo[3,11,40,44,48,49]
Falsified-information attackSpreading misleading data to manipulate sensorsYesYesYesYesYesNoNo[4,11,15,40,44,49]
Cloaking AttackModify attack signatures or behaviors to avoid matching known threat patternsNoNoNoNoNoNoYes[4,12,24]

3.3. Phase 3: Assessing LLM Robustness in Extracting CVSS Attributes

The evaluation of the robustness of LLMs against adversarial attacks using CVSS metrics provides critical insights into their cybersecurity preparedness. AI models can be assessed for resilience by simulating adversarial attacks and generating CVSS scores through the systematic evaluation of attributes such as attack vector, attack complexity, and required privileges. In this part, four AI-based systems (ChatGPT, DeepSeek, Gemini, and Copilot) were tested using standardized vulnerability descriptions from the Common Vulnerabilities and Exposures (CVE) dataset. The experiment reveals that these models rely heavily on textual descriptions to generate CVSS scores, necessitating the use of the CVE dataset as a benchmark for consistency. To ensure a fair comparison, all models were given the same standardized prompts via their respective web-based interfaces (e.g., ChatGPT, Claude, and Gemini). Each large language model (LLM) was manually queried by inputting the prompts through the web interface, and the responses were requested in a structured tabular format. These outputs, which represented predicted CVSS metric values for each CVE instance, were then exported and compiled for downstream analysis. The collected outputs were compared against both the ground truth (CVE dataset labels) and expert-reviewed annotations using a confusion matrix. Standard classification metrics—F1 score, precision, accuracy, and recall—were then calculated to evaluate model performance. Preliminary findings suggest that deviations between the LLM-generated labels and CVE dataset annotations may be attributed to the general-purpose nature of these models and their lack of domain-specific training in vulnerability assessment.

4. Analysis and Severity Assessment of Cybersecurity Threats Targeting Sensor Systems in Modern Vehicles

4.1. Security Attacks

The study examines a range of common attacks in vehicular networks, focusing on those identified across multiple research papers, which include denial-of-service, man-in-the-middle, spoofing, and other malicious activities that exploit vulnerabilities in vehicle systems and sensor networks. According to the mentioned attacks in Table 1, they can be classified as per the description and as per the definition of denial-of-service, man-in-the-middle, and spoofing attacks.
The literature review reveals that a significant portion of the research focused on LiDAR and radar vulnerabilities (72%), justifying the emphasis on these sensor types in the subsequent phase of this study.
The reviewed literature was categorized by attack type and sensor vulnerability, focusing on areas with the highest prevalence of research:
  • LiDAR/Radar Attacks: Twelve studies focused on spoofing, blinding, and jamming attacks (e.g., [3,4,12,26,49]).
  • Network Threats: Eight studies analyzed man-in-the-middle (MitM), denial-of-service (DoS), and Sybil attacks in V2X systems (e.g., [3,11,44,48]).
The analysis of sensor-specific vulnerabilities reveals a significant disparity in the susceptibility of different sensor types to various attack vectors. LiDAR systems emerge as the most vulnerable, being susceptible to nine distinct attack types, including blinding, jamming, and relay attacks. Radar sensors follow closely, with eight identified attack types, while ultrasonic sensors are vulnerable to six types of attacks. These findings underscore the critical importance of securing these sensors due to their central role in autonomous decision-making.
Regarding attack types that are supported by only a few references within our current review, these findings remain significant. While several well-known attacks—such as jamming, blind spot exploitation, sensor interference, and cloaking—are discussed across multiple sources in the literature, certain other advanced or highly specialized vectors are still documented by only a limited number of studies. This does not imply they are unimportant; rather, it reflects either the emerging stage of development for these threats or their niche focus that has yet to attract broader research attention. For example, attacks like falsified-information may represent early-stage or proof-of-concept vulnerabilities that warrant greater attention from researchers and industry alike. Including these types of attacks, regardless of how many references currently address them, is crucial for presenting a forward-looking and comprehensive perspective on the evolving threat landscape. This ensures that potential future risks are not overlooked simply due to limited existing documentation. Our methodology therefore aims to identify both well-established threats—validated by multiple studies—and less widely studied or emerging threats, providing a holistic view of current and prospective attack surfaces in autonomous vehicle sensor systems.

4.2. Attack Vector Analysis from Expert Surveys

A survey of various attack vectors reveals several common patterns and distinctions in how these attacks are carried out, and their potential impact. This analysis examines the attack vector, complexity, requirements, privileges, user interaction, impact on security properties, and CVSS scores for fifteen different attack types. Overall, network-based attacks are the most common, but physical and adjacent attacks also pose significant threats. Most attacks require no user interaction, highlighting the challenge of relying on user awareness for mitigation. Integrity and availability are the security properties most often affected, while confidentiality is less frequently impacted. Table 2 provides a consolidated view of the attack types, highlighting their key characteristics.
Most attacks are rated as high complexity, requiring advanced techniques, though blinding, jamming, disruptive, eavesdropping, blind spot exploitation, acoustic, and sensor interference attacks also present a significant proportion of low-complexity attacks. A large proportion of attacks require preconditions, indicating that attackers often need specific vulnerabilities or setups, but blinding, jamming, disruptive, eavesdropping, blind spot exploitation, acoustic and sensor interference can often be executed without specific preconditions. Most attacks require low or no privileges, making them accessible to a wide range of attackers, whereas replay, Sybil, impersonation and falsified information attacks often require high privileges. A majority of attacks require no user interaction, making them difficult to defend against with user awareness training. Availability and integrity are the most frequently affected security properties, while confidentiality is highly affected by eavesdropping, timing, replay, relay, Sybil, impersonation, falsified information, and cloaking attacks.
The average CVSS scores indicate that most of these attacks range from medium to high severity, with eavesdropping, Sybil, replay, black hole, falsified information and cloaking attacks having the highest average scores, and the standard deviations (Std. Dev) for the CVSS scores range from 1.1 to 1.9, indicating a moderate spread of severity ratings within each attack type, specifically, eavesdropping attack (7.19, Std. Dev: 1.3), Sybil attack (6.76, Std. Dev: 1.6), replay attack (6.35, Std. Dev: 1.7), black hole attack (6.0, Std. Dev: 1.8), falsified information attack (6.22, Std. Dev: 1.8), and cloaking attack (6.66, Std. Dev: 1.5). Physical attack vectors, including blinding, disruptive, blind spot exploitation, and sensor interference attacks, often involve low complexity, as seen in blinding, disruptive, blind spot exploitation, and sensor interference attacks. These attacks frequently require no preconditions and no user interaction, making them particularly challenging to defend against. The Privileges Required for these attacks are typically none or low, increasing the potential attacker base. In terms of impact, these attacks primarily affect availability and integrity. The average CVSS scores for physical attacks range from 4.06 to 5.09, indicating a moderate level of severity.

4.3. Attack Vector Analysis from AI Engine Surveys

This section presents a comparative analysis of attack vectors identified by four AI engines: ChatGPT GPT-4o, DeepSeek-R1, Copilot, and Gemini 1.5 Flash. Each engine conducted the same cybersecurity survey performed by the experts, and their findings are presented separately to highlight the nuances in their assessments. The analysis covers attack types, vectors, complexity, requirements, privileges, user interaction, impact on security properties (confidentiality, integrity, and availability), and CVSS scores. Following the individual analyses, a consolidated summary identifies the key similarities and differences across the AI engines’ perspectives.

4.3.1. ChatGPT Survey Results

Table 3 shows the results from the ChatGPT survey. ChatGPT’s survey highlights a mix of network and physical attacks. Notably, many attacks (jamming, disruptive, replay, and eavesdropping) are seen as low complexity and require no user interaction, increasing their potential impact. High scores are assigned to jamming, disruptive, replay and eavesdropping. Integrity and availability are the most affected security properties.

4.3.2. DeepSeek Survey Results

Table 4 shows the results from the DeepSeek survey. DeepSeek’s survey also emphasizes network-based attacks but with a focus on attacks requiring preconditions. High scores are assigned to replay, relay, Sybil, impersonation, and cloaking attacks. Integrity and availability are highly affected, with confidentiality also being a significant concern.

4.3.3. Copilot Survey Results

Table 5 shows the results from the Copilot survey. Copilot’s survey strongly emphasizes high-severity network-based attacks, particularly black hole, replay, falsified information, and cloaking attacks. It highlights their ability to disrupt network operations, manipulate data integrity, and evade detection. These attacks exploit network vectors, require no user interaction, and often demand no privileges, enabling stealthy, large-scale exploitation. Lower-severity physical and adjacent attacks, such as blinding and acoustic, persist as localized threats, relying on proximity or hardware tampering but lacking scalability. Key trends include a focus on integrity and availability compromise, the prevalence of low/no privilege requirements, and the dominance of high-complexity techniques for persistent threats.

4.3.4. Gemini Survey Results

Table 6 shows the results from the Gemini survey. Gemini’s survey underscores a cybersecurity environment dominated by high-severity network-based attacks, with blinding, jamming, replay, relay, and sensor interference attacks posing the most critical threats. These attacks predominantly exploit network vectors, require no user interaction, and often operate with low or no privileges, enabling large-scale exploitation through passive or automated means. Eavesdropping stands out as a confidentiality-focused risk, targeting unsecured data transmissions, while Sybil, timing, and falsified information attacks leverage high complexity to manipulate trust in distributed systems. Lower-severity local attacks, such as blind spot exploitation and acoustic, remain niche threats, relying on physical proximity or specialized execution.

4.3.5. Comparative Analysis Between AI Agents

A comprehensive examination of the AI-driven evaluations from ChatGPT, DeepSeek, Copilot, and Gemini reveals notable patterns and divergences in how these engines perceive and assess cybersecurity threats in autonomous vehicles. While all four AI agents largely converge on the notion that network-based attacks represent the most pressing threat, there are significant nuances in their individual assessments. For instance, replay attacks consistently received high severity scores across all engines, ranging from 8.7 to 9.2, underscoring their perceived potential to disrupt system integrity and availability. Similarly, falsified information and cloaking attacks were broadly rated as severe, indicating a shared recognition of the risks posed by data manipulation and evasion tactics.
However, substantial variability is evident in the severity scoring of certain attacks. Blinding attacks, for example, were rated as low-severity by DeepSeek (4.4) and Copilot (5.4), but Gemini assessed them much more critically at 8.3, suggesting divergent interpretations of the attack’s real-world impact. Disruptive attacks, too, displayed a similar divergence, with ChatGPT and DeepSeek rating them as highly critical (above 8.7), while Copilot considered them significantly less severe (4.4). These discrepancies may reflect differences in how each AI model contextualizes physical attacks or how much weight they assign to environmental versus intrinsic factors.
The AI engines also varied in their treatment of attack complexity. ChatGPT and Copilot emphasized the threat of low-complexity attacks, highlighting their accessibility and repeatability, which is crucial in assessing how easily an adversary might exploit a vulnerability without requiring advanced techniques or knowledge. In contrast, Gemini and DeepSeek leaned more heavily toward classifying attacks as high complexity, perhaps signaling a more conservative or risk-averse interpretation based on system-level constraints or defensive assumptions.
Another area of divergence lies in privilege requirements. ChatGPT and Copilot consistently emphasized that most attacks demand little to no privileges, thereby broadening the potential attacker base and highlighting the need for systemic defenses rather than access control alone. While DeepSeek and Gemini generally concurred, they placed slightly more emphasis on scenarios requiring specific preconditions or contextual setups before an attack could be carried out.
With respect to the CIA triad, all AI agents agreed that integrity and availability are the most frequently impacted security properties. However, there was variance in the treatment of confidentiality. ChatGPT and DeepSeek recognized confidentiality loss in attacks like eavesdropping, impersonation, and Sybil, whereas Copilot treated it as a secondary concern in most cases. Gemini presented a more balanced view, identifying confidentiality as a concern primarily in attacks involving data leakage or unauthorized observation, such as eavesdropping and replay.
Ultimately, this comparative analysis demonstrates that while the AI engines broadly agree on key threat vectors and high-risk attack types, they also offer unique perspectives shaped by differing prioritizations of complexity, privilege requirements, and system impacts. These differences reinforce the importance of multi-engine evaluation in cybersecurity risk assessments, ensuring that diverse interpretive models are considered when establishing severity baselines and mitigation priorities.

4.4. Comparison Between AI Agents and Expert Survey

When comparing the AI-driven evaluations with the human-centric insights gathered from the expert user survey, several points of convergence and divergence emerge, revealing the complementary strengths of algorithmic and experiential approaches to cybersecurity risk assessment. Both the AI agents and the user survey strongly agree that network-based attacks pose the greatest threat to autonomous vehicles, with high-severity classifications assigned to attacks such as replay, Sybil, eavesdropping, falsified information, and cloaking. This shared emphasis on network vulnerabilities highlights the critical importance of securing vehicular communication systems, particularly as vehicles become increasingly interconnected through protocols like V2X.
The consensus also extends to the impact dimensions of these attacks. Both AI and user perspectives identify integrity and availability as the most commonly compromised security properties, reflecting the potentially catastrophic consequences of data manipulation and system outages in safety-critical automotive functions. Furthermore, there is a uniform recognition that most attacks do not require user interaction, thereby making them difficult to detect and prevent using traditional user-centric security mechanisms like alerts or permissions. This insight underscores the necessity for embedded and autonomous cybersecurity measures that can operate without user mediation.
Nevertheless, the comparison also uncovers key differences that highlight the unique vantage points of human experts versus AI systems. One notable divergence is in the assessment of attack complexity. While AI engines such as ChatGPT and Copilot tended to rate many attacks as low in complexity, suggesting a heightened concern about widespread, easily executed threats, the user survey leaned toward classifying a greater number of attacks as high complexity. This disparity may stem from human evaluators’ greater awareness of real-world implementation constraints and their understanding of layered vehicular architectures that can introduce additional barriers to exploitation.
Privilege requirements also reveal a point of divergence. The AI engines predominantly assessed attacks as requiring little to no privileges, possibly due to their assumption of idealized or default system configurations. In contrast, the user survey suggested that many of the more severe attacks, such as impersonation, Sybil, and falsified information attacks, often necessitate elevated access levels or pre-existing footholds within the system. This discrepancy may be due to the practitioners’ awareness of real-world defensive architectures that add layers of privilege enforcement not always visible in abstract threat modeling.
Another area of contrast lies in the perceived importance of confidentiality. AI agents such as Gemini and DeepSeek frequently rated attacks like eavesdropping, relay, and timing as serious confidentiality threats, while the user survey assigned higher weight to integrity and availability. This could indicate that practitioners are more focused on threats that disrupt system functionality and decision-making processes, whereas AI engines, trained on diverse datasets, may give equal or heightened consideration to data leakage risks.
Furthermore, the AI engines tended to cover a wider spectrum of attack vectors—including physical, local, and adjacent vectors—than the user survey, which focused predominantly on network-based attacks. This broader scope may reflect the AI tools’ systematic and exhaustive analytical frameworks, while human experts may prioritize more immediate and commonly encountered threats based on their domain experience. Taken together, these differences underscore the value of synthesizing AI-based and human-centered analyses in the cybersecurity risk evaluation process. AI engines offer breadth, consistency, and computational rigor, while expert users provide depth, context-awareness, and practical insight grounded in real-world experience. By integrating both perspectives, cybersecurity practitioners can arrive at more balanced and actionable risk assessments, ensuring that both theoretical vulnerabilities and operational realities are accounted for in the development of protective strategies.

5. Application of Generative AI Models for Context-Aware Vulnerability Scoring

5.1. Dataset Characteristics

The dataset utilized for this study comprises 117 records, each detailing various security vulnerability metrics. A significant majority of vulnerabilities exhibit a Network-based Attack Vector (104 instances), with fewer Adjacent (2) and Local (11) vectors. Attack Complexity is predominantly low (108 cases) compared to high (9). Most vulnerabilities maintain an Unchanged Scope (100 instances) rather than Changed (17), and User Interaction is largely unnecessary (97 instances) versus required (20). In terms of Privileges Required, 60 entries demand None, 48 require Low, and 9 need High privileges. The impact metrics show varying distributions: Confidentiality Impact is most frequently Low (57), followed by High (37) and None (23). Integrity Impact leans towards Low (63), with High (33) and None (21) being less common. The Availability Impact presents a near-equal split between Low (35), High (34), and None (48). The Base Score metrics indicate a broad severity range (Minimum: 2.4, Maximum: 10), with a mean of 6.6 and a median of 6.4, suggesting a prevalence of moderate to critical vulnerabilities within the dataset.

5.2. Methodologies of CVSS Attribute Extraction

The following prompt was used to elaborate how each prompt evaluate the result of each attribute: “can you explain to me how the results are generated or in other words how you think about the attributes generated”. The following is the result generated from the three LLMs:
  • ChatGPT: Rule-based NLP heuristics ChatGPT leveraged rule-based NLP heuristics to parse vulnerability descriptions. This approach involved applying predefined keyword-driven heuristics to infer CVSS attributes rather than accessing external CVE databases. The methodology aimed to mimic a human analyst’s logical deductions by recognizing phrases commonly associated with specific CVSS metrics. For instance, keywords like “remote” indicated a network attack vector, while “physical” or “local” suggested local or physical. Phrases such as “easily exploitable” implied low attack complexity, whereas “complex setup required” suggested high. Terms like “unauthenticated” indicated none for Privileges Required, while “admin” or “root” denoted high. User interaction was inferred from mentions like “user must open” or “social engineering”. References to broader system impact suggested a changed scope, while their absence implied unchanged. Impact metrics (confidentiality, integrity, and availability) were influenced by mentions of “data leakage”, “tampering”, and “denial of service”, respectively.
  • DeepSeek: Structured question-based approach DeepSeek adopted a structured question-driven methodology, explicitly querying vulnerability descriptions against predefined conditions. Each CVSS attribute was determined through specific queries designed to guide the model in analyzing relevant phrases. For the attack vector, it identified remote attacks as network and local execution as local. Attack complexity was differentiated between low-complexity attacks (“trivial to exploit”) and high-complexity scenarios (“race condition”). The Privileges Required were categorized into none, low, or high based on authentication levels. User interaction involved evaluating whether explicit user action was required. Scope determined cross-system impact if stated. The impact metrics (confidentiality, integrity, and availability) assessed explicit mentions of data exposure, manipulation, or service disruption.
  • Gemini: Systematic text translation using the worst-case principle, Gemini employed a systematic attribute extraction approach that relied strictly on the vulnerability description itself, avoiding assumptions beyond explicit statements. Its methodology incorporated exclusive reliance on provided text to prevent external data usage and ensure self-contained analysis. A key principle was the worst-case impact principle, where, if ambiguity existed, values were assigned based on the maximum potential risk. Attribute-by-attribute derivation iteratively evaluated each CVSS metric following strict definitional guidance.

5.3. Performance Evaluation of LLMs

The performance of each LLM in extracting CVSS attributes was evaluated based on the precision, recall, and F1 score for each class within each attribute.

5.3.1. ChatGPT Results

A comprehensive evaluation across eight key vulnerability metrics reveals substantial performance heterogeneity in the multiclass classification system. While the model achieves robust results for majority classes (e.g., Attack Complexity-LOW: F1 = 93.12%, Scope-UNCHANGED: F1 = 91.59%), it exhibits critical deficiencies in minority and high-impact categories. Most notably, the system completely fails to identify HIGH attack complexity (F1 = 0%), severely underperforms on CHANGED scope (Recall = 30.00%), and demonstrates near-chance prediction for HIGH availability impact (F1 = 34.78%). Class imbalance emerges as a predominant limitation, particularly affecting adjacent attack vectors (Support = 5, F1 = 33.33%) and high privilege requirements (Support = 14, F1 = 24.99%). Aggregate accuracy ranges from 51.61% (availability) to 87.12% (attack complexity), but these figures mask severe per-class deficiencies as evidenced by macro-F1 scores as low as 49.69% (availability). These results, summarized in Table 7, highlight the fundamental challenges in handling class imbalance and distinguishing high-severity vulnerability attributes within the current framework.

5.3.2. DeepSeek Results

Quantitative evaluation across eight critical vulnerability attributes demonstrates significant performance variations, with accuracy ranging from 65.15% (Availability) to 98.29% (Attack Vector). The model achieves exceptional performance for majority classes, notably NETWORK vectors (F1 = 99.07%, Recall = 100%) and UNCHANGED scope (Accuracy = 90.60%, F1 = 94.81%). However, severe deficiencies emerge in high-impact minority categories: complete failure in HIGH complexity detection (F1 = 0%, Accuracy = 92.31%), critically low recall for LOCAL vectors (81.82%) and CHANGED scope (38.46%), and alarmingly weak HIGH availability prediction (F1 = 54.05%, Accuracy = 65.15%). Performance degradation correlates strongly with class rarity as evidenced by 33.02% average accuracy drop for attributes with <20 support instances. These results, summarized in Table 8, highlight the fundamental limitations in handling class imbalance and high-severity categories, necessitating resampling strategies and cost-sensitive learning approaches for operational deployment.

5.3.3. Gemini Results

Evaluation of the multiclass vulnerability classification system reveals significant performance variations across security attributes, with accuracy ranging from 63.75% (Availability) to 95.73% (Attack Vector). While the model demonstrates strong capability in majority classes such as NETWORK vectors (F1 = 97.60%) and LOW complexity (F1 = 95.24%), it exhibits critical deficiencies in high-impact minority categories. Most alarmingly, HIGH attack complexity shows near-failure performance (F1 = 15.38%), LOCAL vectors suffer substantially reduced recall (76.92%), and CHANGED scope detection remains critically low (Recall = 40.00%). Performance degradation correlates strongly with class rarity, with HIGH-privilege requirements (Support = 14) and HIGH-availability impacts (Support = 23) showing 32.15% and 16.11% F1 reductions, respectively, compared to their NONE counterparts. These results, summarized in Table 9, coupled with macro-F1 scores as low as 63.30% (Availability), highlight the fundamental challenges in class imbalance handling and high-severity vulnerability recognition that must be addressed for operational deployment.

5.4. Comparative Analysis of LLM Results

A comparative analysis of ChatGPT, DeepSeek, and Gemini reveals both similarities and differences in their performance across CVSS attributes.

5.4.1. Similarities

  • Class Imbalance Impact: All three models consistently demonstrated superior performance on majority classes (e.g., NETWORK Attack Vector, LOW Attack Complexity, UNCHANGED Scope, and NONE User Interaction). Conversely, they all struggled significantly with minority classes (e.g., ADJACENT Attack Vector for ChatGPT, HIGH complexity across all models, CHANGED Scope, and REQUIRED User Interaction). This pattern is attributed to the inherent class imbalance in the dataset, where models tended to prioritize the more frequent categories.
  • Difficulty with “HIGH” Class Predictions: A universal challenge for all three models was the accurate prediction of the “HIGH” class across various attributes, particularly Attack Complexity, Privileges Required, Confidentiality Impact, Integrity Impact, and Availability Impact. For instance, ChatGPT completely failed to predict HIGH complexity (0% precision, recall, and F1), and DeepSeek also showed 0% for this class. Gemini, while not zero, still had an abysmal F1 score of 15.38% for HIGH complexity. This suggests a fundamental difficulty in recognizing nuanced indicators for severe impact or effort levels, possibly due to limited training examples for these critical but rarer scenarios.
  • Multi-Class Attribute Challenges: For attributes with three classes (Privileges Required, Confidentiality Impact, Integrity Impact, and Availability Impact), all models showed declining performance for the less frequent classes (LOW and HIGH) compared to the NONE class. This indicates general confusion or less robust discrimination between these closely related impact levels.

5.4.2. Differences

  • Attack Vector Handling: ChatGPT treated ADJACENT as a separate class, leading to poor performance (28.57% precision, 40.00% recall) due to its low representation. DeepSeek and Gemini merged ADJACENT into the NETWORK class, which significantly boosted their NETWORK class performance (DeepSeek: 98.15% precision, 100.00% recall for NETWORK; Gemini: 97.14% precision, 98.08% recall for NETWORK). This methodological difference directly impacted reported metrics for the attack vector attribute.
  • Overall Performance on Multi-Class Attributes: DeepSeek generally demonstrated stronger macro-average F1 scores for the three-class attributes (Privilege Required: 80.47%; Confidentiality: 74.42%; Integrity: 72.19%; Availability: 63.85%) compared to ChatGPT (Privilege Required: 57.44%; Confidentiality: 56.57%; Integrity: 56.33%; Availability: 49.69%) and Gemini (Privilege Required: 74.04%; Confidentiality: 76.31%; Integrity: 69.37%; Availability: 63.30%). DeepSeek’s structured question-based approach appears to yield more balanced performance across these classes, especially for “HIGH” values, compared to ChatGPT’s rule-based heuristics, which struggled significantly with minority classes. Gemini showed competitive performance with DeepSeek on Confidentiality and Availability but slightly lower on Privilege Required and Integrity.
  • Recall for “REQUIRED” User Interaction: DeepSeek exhibited a slightly better recall for the “REQUIRED” user interaction class (44.44%) compared to ChatGPT (40.00%) and Gemini (43.75%), although all models still showed a “critical gap” in detecting these interactions due to imbalance.
  • Approach to Ambiguity: Gemini’s “worst-case impact principle” is a distinct feature, aiming to assign values based on maximum potential risk in ambiguous cases. While the direct impact of this principle on specific metrics is not explicitly quantified as a separate variable in the provided data, it represents a notable difference in its underlying decision-making logic compared to the other models.

5.4.3. Discussion

In summary and as shown in Figure 5, while all three LLMs grappled with class imbalance and the accurate prediction of minority HIGH classes, DeepSeek generally demonstrated more robust performance across multi-class attributes. ChatGPT’s rule-based heuristics led to more pronounced failures in minority class detection, particularly for ADJACENT Attack Vector and HIGH complexity. Gemini, while competitive in some areas, also struggled with minority classes, although its worst-case principle offers a different philosophical approach to vulnerability assessment. These differences highlight the varying levels of contextual interpretation and robustness each model provides in vulnerability assessment.

6. Hybrid Vulnerability Scoring Model

To benchmark LLM performance against traditional approaches, we developed a dedicated machine learning classifier for attack complexity prediction. This model was trained exclusively on CVE textual descriptions to classify complexity as LOW or HIGH using two distinct methodologies as follows.

6.1. Model Architectures

  • Traditional ML Pipeline (TF-IDF + Logistic Regression)
    • Text Representation: TF-IDF vectorization (max 3000 features)
    • Classification: Logistic Regression with class weighting
    • Class Imbalance Handling: Inverse frequency weighting
  • Transformer-Based Model (BERT)
    • Base Architecture: bert-base-uncased
    • Fine-tuning: 3 epochs (batch size = 8)
    • Sequence Length: 512 tokens
    • Evaluation Metrics: Precision, Recall, F1 (per class)

6.2. Performance Comparison

This study demonstrates that class imbalance represents a fundamental limitation in automated vulnerability assessment, affecting both large language models (LLMs) and dedicated machine learning classifiers. Key findings, as detailed in Table 10 and illustrated in Figure 6, reveal the following:
  • Universal Challenge: All evaluated approaches—including ChatGPT, DeepSeek, Gemini, and specialized classifiers (TF-IDF/Logistic Regression, BERT)—exhibited critical failures in detecting minority classes. The complete inability of BERT to identify HIGH-complexity vulnerabilities (0% recall) underscores that architectural sophistication alone cannot overcome inherent data distribution limitations.
  • LLM Performance Hierarchy:
    • Attack Complexity (HIGH) emerged as the most challenging attribute across LLMs (F1 ≤ 15.38%), justifying its selection for targeted enhancement.
    • Increasing class granularity improved discrimination, with multi-class attributes (e.g., privilege required) yielding higher accuracy than binary classifications.
  • Traditional ML Advantage: In low-data regimes, classical methods (TF-IDF + Logistic Regression) demonstrated greater robustness to imbalance compared to transformers, achieving non-zero detection (40.0% F1) for HIGH-complexity cases where fine-tuned BERT failed completely.
Table 10. Performance of traditional vs. transformer-based models for attack complexity prediction.
Table 10. Performance of traditional vs. transformer-based models for attack complexity prediction.
ModelAccuracyPrecision (LOW)Precision (HIGH)Recall (LOW)Recall (HIGH)Macro F1Processing Time
TF-IDF + Logistic Regression82.1%88.2%40.0%90.9%33.3%63.0%2 min
BERT (Fine-tuned)84.6%84.6%0.0%100.0%0.0%45.8%30 min
Figure 6. Model performance comparison.
Figure 6. Model performance comparison.
Futureinternet 17 00339 g006

6.3. Key Observations

  • Class imbalance sensitivity: Both models exhibited bias toward the majority class (LOW), mirroring the LLMs’ performance limitations. BERT completely failed to identify HIGH-complexity cases.
  • Traditional vs. Transformer Tradeoffs:
    • Logistic regression showed limited but non-zero capability for HIGH- complexity detection.
    • BERT achieved superior accuracy but collapsed predictions to the majority class.
    • Both underperformed compared to the best LLMs (e.g., Gemini achieved 15.38% F1 on HIGH).
  • Data Requirements: The transformer model required substantially more data than available (193 samples), highlighting a key limitation for vulnerability classification tasks.

6.4. Proposed Model

This study proposes a novel hybrid Human–AI model for automated yet context-aware vulnerability scoring of sensor system threats, leveraging the strengths of both large language models (LLMs) and domain experts. The proposed model operates by dissecting the vulnerability scoring process into modular attribute-based components, following the CVSS v4.0 structure. Specifically, eight core attributes are used for scoring: attack vector, scope, user interaction, privilege required, confidentiality impact, integrity impact, availability impact, and attack complexity. Each of these attributes plays a fundamental role in defining the severity of a given cyber threat, yet presents varying levels of interpretability for AI models and human experts.
As shown in Figure 7, for seven of these attributes—excluding attack complexity—the model aggregates the output of three LLMs: ChatGPT, DeepSeek, and Gemini. Each model processes a textual vulnerability description and outputs predicted CVSS values. Due to variability in model behavior and heuristic approaches, each attribute prediction is normalized and averaged across the three models. This ensemble approach mitigates model-specific biases and improves the overall robustness and generalizability of predictions. For instance, while ChatGPT may excel in extracting structured patterns from loosely described attack scenarios, DeepSeek applies rule-based classification, and Gemini incorporates a worst-case principle to reduce the underestimation risks in ambiguous contexts. The aggregation of these three perspectives leads to a more balanced attribute assessment.
In contrast, the attribute of attack complexity poses a unique challenge for AI models due to its reliance on nuanced contextual interpretation, such as the preconditions required for a successful exploit. As such, the model integrates two non-AI inputs for this attribute: expert scoring from domain professionals (e.g., automotive security researchers) and a dedicated machine learning classifier trained specifically on annotated CVE datasets. The expert input introduces contextual granularity, while the classifier brings scale and repeatability. These two inputs are averaged to produce a reliable value for attack complexity, enhancing the credibility of the hybrid model.
Following attribute extraction, the hybrid model feeds the final averaged attribute values into a CVSS v4.0-based scoring algorithm, which calculates a composite base score for each identified vulnerability. This final hybrid score reflects a multi-dimensional assessment that is data-driven, context-aware, and validated by human expertise. The modular architecture also allows for future extension, such as the integration of additional AI models or context-specific weight tuning.
Overall, this proposed hybrid Human–AI model highlights key limitations of automated vulnerability scoring in domain-specific settings.It bridges the gap between general-purpose AI capabilities and the contextual awareness required in safety-critical applications like autonomous vehicles. By combining statistical inference with expert judgment, the model enhances accuracy, reduces bias, and supports scalable, real-time security evaluations across sensor types and attack vectors.
In contrast, the attribute of attack complexity poses a unique challenge for AI models due to its reliance on nuanced contextual interpretation, such as the preconditions required for a successful exploit. To address this, a dedicated machine learning classifier was trained specifically on annotated CVE datasets to assist in estimating attack complexity. However, this was complemented—and ultimately validated—by expert scoring from domain professionals (e.g., automotive security researchers), whose judgments were also used for all other CVSS attributes. The expert input ensured contextual accuracy across the full attribute set, including attack complexity, while the machine learning component contributed additional scale and automation specifically for that single attribute. The final values for all attributes were determined based on expert evaluation, and only the attack complexity attribute incorporated machine learning input as a secondary reference. These expert-evaluated attributes were then input into a CVSS v4.0-based scoring algorithm to compute a composite base score for each identified vulnerability. This score reflects a multi-dimensional assessment that is context-aware, domain-informed, and guided by human expertise. The model remains modular and extensible, allowing for the future inclusion of AI components or custom weighting schemes as needed. By grounding automated assistance in expert oversight, this approach enhances accuracy, preserves contextual relevance, and supports scalable, real-time security evaluation in safety-critical domains such as autonomous vehicles.
To validate the performance and practical utility of the attack complexity classifier within the hybrid model, a series of experiments were conducted comparing multiple machine learning models on both imbalanced and balanced datasets. Several classifiers were evaluated, including traditional approaches such as Logistic Regression and Support Vector Machines (SVM), the Universal Sentence Encoder (USE), and Deep Learning models like BERT. The goal was to assess each model’s predictive power, particularly in handling the contextual nuances required for classifying attack complexity, while also evaluating computational efficiency.
The initial dataset exhibited a significant class imbalance, with 3350 samples labeled as LOW and only 310 as HIGH, resulting in an imbalance ratio of approximately 10.81:1. The performance of each model on this imbalanced dataset is summarized in Table 11. As shown, BERT achieved the highest accuracy (0.846), marginally outperforming traditional models such as TF-IDF + Logistic Regression and SVM (both at 0.821). However, BERT’s training time (30 min) was substantially higher than that of traditional models, which completed training in about one second.
To explore the effects of class balance, a second experiment was conducted using a balanced dataset of 500 samples, equally split between the two classes. As summarized in Table 12, traditional models again performed strongly. Both SVM and TF-IDF + Logistic Regression achieved an accuracy of 0.87 and high F1 scores, while maintaining minimal training times. In contrast, the USE models demonstrated varied performance, and tuning efforts did not consistently yield improvements.
These findings underscore three key insights: first, that addressing class imbalance is essential to model fairness and performance; second, that traditional machine learning models remain strong contenders, particularly in resource-constrained environments; and third, that complex models like BERT, while powerful, may not always justify their computational demands in every application context.
Ultimately, this evaluation supports the use of traditional machine learning models for the Attack Complexity classifier in the hybrid Human–AI scoring model. When combined with expert input, these models provide a reliable, efficient, and context-sensitive mechanism for assessing cyber vulnerabilities—especially in safety-critical domains like autonomous vehicles.

6.5. Model Evaluation

The classification reports provide a detailed breakdown of how well each AI model’s survey results align with the actual attack characteristics, leveraging precision, recall, and F1 score to understand performance in multi-class classification, especially with imbalanced classes. “Support” indicates the number of actual occurrences for each class.
The Gemini Survey results show mixed performance, generally lagging behind other models in overall accuracy across attributes. It struggles significantly with ‘Attack Vector’ (AV), exhibiting (0%) precision and recall for ‘Adjacent’ and ‘Physical’ types, and a lower precision (53.85%) for ‘Network’ despite high recall (100%). The model completely misses predicting ‘None’ for ‘Attack Type’ (AT) and ‘High’ for ‘Privilege Required’ (PR), as well as ‘Low’ for ‘Confidentiality’ (C) and ‘Low’ and ‘None’ for ‘Integrity’ (I), indicated by (0%) recall for these categories. This suggests that Gemini tends to be overly cautious or biased towards predicting certain common classes, leading to poor performance on less frequent or minority classes.
Copilot demonstrates a more balanced and generally stronger performance compared to Gemini, with robust overall accuracies. It shows strong predictive capabilities in ‘Attack Vector’ (AV) and ‘Availability’ (A), particularly excelling in ‘Physical’ AV with (80%) precision and (100%) recall. For ‘User Interaction’ (UI), it achieves (90.91%) precision for ‘None’. However, Copilot also exhibits weaknesses in specific areas, struggling with ‘Low’ AC ((60%) precision, (42.86%) recall) and completely missing ‘High’ PR and ’Required’ UI ((0%) recall). ‘Confidentiality’ (C) shows good recall for ‘High’ and ‘Low’ but very low precision for ‘Low’ (25.00%).
DeepSeek’s survey results stand out as the strongest performer among the evaluated models, consistently achieving higher overall accuracies. It demonstrates strong predictive capabilities for ’Attack Complexity’ (AC) and ’Confidentiality’ (C), particularly for ’High’ C ((85.71%) precision, (75%) recall) and ’None’ C ((100.00%) precision, (83.33%) recall). For ’Attack Vector’ (AV), it accurately identifies ’Network’ with (100%) recall. Nevertheless, DeepSeek also faces challenges with minority classes like ’Adjacent’ AV ((0%) precision and recall) and ’High’ PR ((0%) precision and recall).
ChatGPT’s survey results show performance that is generally better than Gemini but not as strong or consistent as Copilot or DeepSeek. It performs exceptionally well for ‘User Interaction’ (UI) for the ‘None’ class ((92.31%) precision, (85.71%) recall). However, it struggles significantly with ‘Attack Complexity’ (AC), especially ‘High’ AC ((50.00%) precision, (12.50%) recall), and ‘Confidentiality’ (C), particularly ‘High’ C ((100.00%) precision but only (25.00%) recall). Similar to other models, it shows (0%) precision and recall for some low-support classes.
The results of the traditional ML models show varying performance across the models, with overall accuracies ranging from approximately (46.7%) to (60%). The GloVe + Logistic Regression model demonstrate the weakest performance, particularly struggling with the ‘High’ class where it shows very low recall and F1 score, suggesting a strong bias towards predicting ‘Low’ instances. In contrast, the SVM (TF-IDF) model emerges as the most balanced and robust performer, achieving the highest macro F1 score and more comparable F1 scores for both ‘High’ and ‘Low’ classes, indicating a better overall ability to classify both categories. The TF-IDF + Logistic Regression and USE + Logistic Regression models, while achieving reasonable accuracy ((60%) and (53.3%), respectively), exhibit a clear tendency to favor the ‘High’ class, demonstrating high recall for ‘High’ but significantly lower recall and F1 scores for the ‘Low’ class, implying they often misclassify ‘Low’ instances as ‘High’. In conclusion, while no model achieves exceptionally high accuracy, SVM (TF-IDF) provides the most reliable and equitable performance across both classes, whereas the other models often showed a clear imbalance in their predictive capabilities.
To assess the effectiveness of AI-assisted vulnerability scoring, a hybrid approach was applied using outputs from four large language models (LLMs)—ChatGPT, DeepSeek, Copilot, and Google Gemini—and a traditional machine learning model for Attack Complexity. The hybrid model employed majority voting among LLMs or prediction aggregation in the case of ML. Evaluation was conducted against expert annotations for eight CVSS v4.0 base metrics across 15 attacks. The following analysis details the hybrid model’s performance per attribute.
  • Attack Vector: The hybrid model correctly predicted 10 out of 15 values, yielding a 66.7% match with expert annotations. Errors were most common in distinguishing between Local, Adjacent, and Network, whereas Physical and Network vectors were more consistently identified. A summary of these results is provided in Table 13, which highlights the strengths of the model and the misclassifications patterns across different attack vectors.
    Table 13. Attack vector classification results: hybrid vs. expert labels.
    Table 13. Attack vector classification results: hybrid vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackPhysicalPhysicalPhysicalNetworkPhysicalPhysicalTrue
    Jamming AttackNetworkNetworkAdjacentNetworkNetworkAdjacentFalse
    Black Hole AttackAdjacentNetworkNetworkNetworkNetworkNetworkTrue
    Timing AttackNetworkNetworkLocalNetworkNetworkNetworkTrue
    Disruptive AttackNetworkNetworkPhysicalNetworkNetworkPhysicalFalse
    Replay AttackNetworkNetworkNetworkNetworkNetworkNetworkTrue
    Relay AttackPhysicalNetworkAdjacentNetworkNetworkAdjacentFalse
    Eavesdropping AttackNetworkNetworkAdjacentNetworkNetworkNetworkTrue
    Sybil AttackNetworkNetworkNetworkNetworkNetworkNetworkTrue
    Blind Spot ExploitationPhysicalPhysicalPhysicalLocalPhysicalPhysicalTrue
    Sensor Interference AttackNetworkPhysicalPhysicalNetworkNetworkAdjacentFalse
    Acoustic AttackPhysicalPhysicalPhysicalLocalPhysicalPhysicalTrue
    Impersonation AttackNetworkNetworkNetworkNetworkNetworkNetworkTrue
    Falsified Information AttackNetworkNetworkNetworkNetworkNetworkNetworkTrue
    Cloaking AttackNetworkNetworkNetworkNetworkNetworkLocalFalse
  • Attack Complexity: Predicted via machine learning, the model achieved 9 out of 15 correct predictions (60.0%). This result aligns closely with the model’s internal validation performance. The model exhibited better performance in recognizing High complexity than Low, particularly when contextual factors were less obvious.These findings are summarized in Table 14, which highlights the model’s accuracy and the influence of context on classification performance.
    Table 14. Attack complexity classification results: ML hybrid model vs. expert labels.
    Table 14. Attack complexity classification results: ML hybrid model vs. expert labels.
    AttackHybrid PredictionExpert LabelMatch
    Blinding AttackLOWLowTrue
    Jamming AttackLOWLowTrue
    Black Hole AttackHIGHLowFalse
    Timing AttackLOWHighFalse
    Disruptive AttackHIGHLowFalse
    Replay AttackHIGHHighTrue
    Relay AttackLOWHighFalse
    Eavesdropping AttackLOWLowTrue
    Sybil AttackHIGHHighTrue
    Blind Spot ExploitationHIGHLowFalse
    Sensor Interference AttackHIGHLowFalse
    Acoustic AttackHIGHHighTrue
    Impersonation AttackHIGHHighTrue
    Falsified Information AttackHIGHHighTrue
    Cloaking AttackHIGHHighTrue
  • Attack Requirements: Also reaching a 60.0% match (9/15), the hybrid model performed moderately in identifying whether specific conditions or requirements were necessary before an attack could be successfully executed. This metric appeared sensitive to implicit contextual cues, which may not be fully captured by LLMs. A detailed breakdown of these results is presented in Table 15, illustrating the model’s strengths and contextual limitations.
    Table 15. Attack requirements classification results: LLMs vs. expert labels.
    Table 15. Attack requirements classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackNonePresentPresentPresentPresentNoneFalse
    Jamming AttackNonePresentPresentPresentPresentNoneFalse
    Black Hole AttackPresentPresentPresentPresentPresentNoneFalse
    Timing AttackPresentPresentPresentPresentPresentPresentTrue
    Disruptive AttackNoneNoneNonePresentNoneNoneTrue
    Replay AttackNonePresentPresentPresentPresentNoneFalse
    Relay AttackNonePresentPresentPresentPresentPresentTrue
    Eavesdropping AttackNonePresentPresentNoneNoneNoneTrue
    Sybil AttackPresentPresentPresentPresentPresentPresentTrue
    Blind Spot ExploitationPresentPresentPresentPresentPresentPresentTrue
    Sensor Interference AttackPresentPresentPresentPresentPresentNoneFalse
    Acoustic AttackPresentPresentPresentPresentPresentNoneFalse
    Impersonation AttackNonePresentPresentPresentPresentPresentTrue
    Falsified Information AttackNonePresentPresentPresentPresentPresentTrue
    Cloaking AttackPresentPresentPresentPresentPresentPresentTrue
  • Privileges Required: This metric showed the weakest performance, with only 7 out of 15 correct predictions (46.7%). Confusion was prevalent between None, Low, and High privileges, indicating a need for deeper contextual comprehension or enriched training data. These results are summarized in Table 16, which details the model’s confusion across privilege levels.
    Table 16. Privileges Required classification results: LLMs vs. expert labels.
    Table 16. Privileges Required classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackNoneNoneNoneNoneNoneNoneTrue
    Jamming AttackNoneNoneNoneNoneNoneLowFalse
    Black Hole AttackLowNoneNoneLowLowHighFalse
    Timing AttackNoneLowHighLowLowLowTrue
    Disruptive AttackNoneNoneNoneLowNoneNoneTrue
    Replay AttackNoneNoneNoneNoneNoneHighFalse
    Relay AttackNoneNoneNoneNoneNoneHighFalse
    Eavesdropping AttackNoneNoneNoneNoneNoneNoneTrue
    Sybil AttackNoneNoneLowLowNoneHighFalse
    Blind Spot ExploitationNoneNoneNoneNoneNoneNoneTrue
    Sensor Interference AttackNoneNoneNoneNoneNoneNoneTrue
    Acoustic AttackNoneNoneNoneNoneNoneNoneTrue
    Impersonation AttackLowLowLowLowLowNoneFalse
    Falsified Information AttackNoneLowNoneLowLowHighFalse
    Cloaking AttackHighNoneNoneLowNoneHighFalse
  • User Interaction: Achieving the highest accuracy (14/15, or 93.3%), the hybrid model was very effective at determining whether human intervention was necessary. However, it is worth noting that 14 of the 15 expert annotations were “None”, which may skew the perception of model generalizability. The corresponding results are provided in Table 17, highlighting the impact of class distribution on performance.
    Table 17. User interaction classification results: LLMs vs. expert labels.
    Table 17. User interaction classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackNoneNoneNoneNoneNoneNoneTrue
    Jamming AttackNoneNoneNoneNoneNoneNoneTrue
    Black Hole AttackNoneNoneNoneNoneNoneNoneTrue
    Timing AttackNoneNoneNoneNoneNoneNoneTrue
    Disruptive AttackNoneNoneNoneNoneNoneNoneTrue
    Replay AttackNoneNoneNoneNoneNoneRequiredFalse
    Relay AttackNoneNoneNoneNoneNoneNoneTrue
    Eavesdropping AttackNoneNoneNoneNoneNoneNoneTrue
    Sybil AttackNoneNoneNoneNoneNoneNoneTrue
    Blind Spot ExploitationNoneNoneNoneNoneNoneNoneTrue
    Sensor Interference AttackNoneNoneNoneNoneNoneNoneTrue
    Acoustic AttackNoneNoneNoneNoneNoneNoneTrue
    Impersonation AttackNoneNoneNoneNoneNoneNoneTrue
    Falsified Information AttackNoneNoneNoneNoneNoneNoneTrue
    Cloaking AttackNoneNoneNoneNoneNoneNoneTrue
  • Confidentiality: The model accurately predicted confidentiality impacts in 12 out of 15 cases (80.0%). Misclassifications largely occurred between None and Low, while High confidentiality impacts were predicted reliably. These results are summarized in Table 18, which presents the model’s classification performance across confidentiality levels.
    Table 18. Confidentiality classification results: LLMs vs. expert labels.
    Table 18. Confidentiality classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackNoneNoneHighNoneNoneNoneTrue
    Jamming AttackNoneLowLowNoneNoneLowFalse
    Black Hole AttackNoneLowHighNoneNoneHighFalse
    Timing AttackHighLowHighHighHighHighTrue
    Disruptive AttackNoneNoneLowNoneNoneNoneTrue
    Replay AttackNoneHighHighNoneHighHighTrue
    Relay AttackNoneHighHighLowHighHighTrue
    Eavesdropping AttackHighHighHighHighHighHighTrue
    Sybil AttackNoneHighHighHighHighHighTrue
    Blind Spot ExploitationNoneNoneLowNoneNoneNoneTrue
    Sensor Interference AttackNoneNoneLowNoneNoneNoneTrue
    Acoustic AttackNoneNoneNoneLowNoneNoneTrue
    Impersonation AttackNoneHighHighHighHighHighTrue
    Falsified Information AttackNoneHighHighHighHighNoneFalse
    Cloaking AttackNoneHighHighHighHighHighTrue
  • Integrity: With 10 matches out of 15 (66.7%), the hybrid model showed strong performance in recognizing High integrity impacts. However, predictions involving None and Low were more error-prone, similar to the confidentiality results. The detailed results are presented in Table 19, highlighting the model’s effectiveness in identifying high-impact cases
    Table 19. Integrity classification results: LLMs vs. expert labels.
    Table 19. Integrity classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackHighHighHighHighHighLowFalse
    Jamming AttackNoneLowLowHighLowHighFalse
    Black Hole AttackLowHighHighHighHighHighTrue
    Timing AttackNoneHighHighHighHighHighTrue
    Disruptive AttackNoneLowLowHighLowHighFalse
    Replay AttackHighHighHighHighHighHighTrue
    Relay AttackHighHighHighHighHighHighTrue
    Eavesdropping AttackNoneNoneNoneNoneNoneHighFalse
    Sybil AttackHighHighHighHighHighHighTrue
    Blind Spot ExploitationHighHighHighHighHighNoneFalse
    Sensor Interference AttackHighHighHighHighHighNoneFalse
    Acoustic AttackLowHighNoneHighHighHighTrue
    Impersonation AttackHighHighHighHighHighHighTrue
    Falsified Information AttackHighHighHighHighHighHighTrue
    Cloaking AttackHighHighHighHighHighHighTrue
  • Availability: This metric showed strong performance with 13 out of 15 matches (86.7%), and errors were limited to cases where the expert labeled the impact as Low or None. High availability impact cases were robustly captured. These results are detailed in Table 20, illustrating the model’s accuracy in high-severity scenarios.
    Table 20. Availability classification results: LLMs vs. expert labels.
    Table 20. Availability classification results: LLMs vs. expert labels.
    AttackChatGPTDeepSeekCopilotGeminiHybridExpertMatch
    Blinding AttackLowHighHighHighHighHighTrue
    Jamming AttackHighHighHighHighHighHighTrue
    Black Hole AttackHighHighHighHighHighLowFalse
    Timing AttackNoneHighLowHighHighLowFalse
    Disruptive AttackHighHighHighHighHighHighTrue
    Replay AttackNoneLowHighHighHighHighTrue
    Relay AttackNoneLowHighHighHighNoneFalse
    Eavesdropping AttackNoneNoneNoneNoneNoneNoneTrue
    Sybil AttackHighLowHighHighHighHighTrue
    Blind Spot ExploitationLowHighHighHighHighHighTrue
    Sensor Interference AttackHighHighHighHighHighHighTrue
    Acoustic AttackHighHighHighHighHighHighTrue
    Impersonation AttackNoneLowHighHighHighHighTrue
    Falsified Information AttackHighLowHighHighHighHighTrue
    Cloaking AttackNoneLowHighHighHighHighTrue
  • Discussion and Conclusion: As shown in Figure 8, and Table 21, the hybrid AI model demonstrates promising performance in automating CVSS scoring, particularly for impact-related metrics such as Availability, Confidentiality, and User Interaction. However, performance drops for more context-dependent attributes like Privileges Required, suggesting limitations in the LLM interpretability of implicit assumptions and access conditions. The analysis of metric-level prediction accuracy reveals key strengths and weaknesses in the current CVSS component prediction model. The model demonstrates strong performance in identifying User Interaction (93.3%), Availability (86.7%), and Confidentiality (80.0%), suggesting a solid understanding of direct impacts on system behavior and user involvement. However, noticeable gaps emerge in metrics such as Privileges Required (46.7%), Attack Complexity (60.0%), and Attack Requirements (60.0%), indicating challenges in assessing contextual and precondition-based factors that influence exploitability. These components likely require more nuanced data or refined logic in the prediction algorithm. While the overall average accuracy stands at a reasonable 70%, the variation across metrics highlights the need for targeted improvements—particularly in privilege assessment and attack path analysis—to enhance the consistency and reliability of automated CVSS scoring tools. The consistent performance across several metrics supports the feasibility of hybrid AI models in vulnerability triage. Yet, certain metrics—especially those with nuanced human interpretations—still necessitate expert validation. Future improvements could focus on prompt engineering, fine-tuned model training with contextualized examples, and the integration of more domain-specific rules.
    Figure 8. Accuracy by CVSS metric.
    Figure 8. Accuracy by CVSS metric.
    Futureinternet 17 00339 g008
    Table 21. Summary of hybrid model prediction accuracy per CVSS base metric.
    Table 21. Summary of hybrid model prediction accuracy per CVSS base metric.
    MetricCorrect PredictionsTotal CasesAccuracy (%)
    Attack Vector101566.7%
    Attack Complexity91560.0%
    Attack Requirements91560.0%
    Privileges Required71546.7%
    User Interaction141593.3%
    Confidentiality121580.0%
    Integrity101566.7%
    Availability131586.7%
    Overall Average8412070.0%
To evaluate the overall accuracy and practical applicability of the hybrid model, a comparison using the heatmap was made between the CVSS base scores derived from the model and those determined by domain experts. This heatmap provides a comparative visual of four key metrics—Hybrid CVSS, Expert CVSS, standard deviation, and residuals—across 15 different attack scenarios. As shown in Figure 9, the Hybrid CVSS values demonstrated strong alignment with expert assessments across the 15 evaluated attacks.
The average CVSS score assigned by the hybrid model was found to be consistent with the expert mean scores in most cases, despite minor discrepancies. Notably, in several high-impact attacks—such as those scoring 9.2 and 8.7 from the hybrid model—the expert evaluations were slightly lower but still within the expected range when considering the standard deviation (ranging from 1.1 to 1.9). For instance, an attack scored at 9.2 by the hybrid model corresponded to an expert mean of 6.35 with a standard deviation of 1.7, placing the hybrid result within approximately 1.7 standard deviations of expert consensus, which is statistically acceptable.
Furthermore, the overall trend indicates that the hybrid model tends to slightly overestimate the severity compared to expert evaluations. However, considering the expert score variability (standard deviations averaging around 1.5), this bias remains within a tolerable and explainable range. The standard deviation also highlights the subjective nature of manual scoring, reinforcing the hybrid model’s consistency and reproducibility as a strength.
In conclusion, the heatmap reveals a general trend of overestimation by the Hybrid CVSS in several cases when compared to expert judgment. While there is moderate consistency for low-severity attacks, attention should be given to the higher residuals, which may highlight areas for model calibration or expert disagreement. Also, the hybrid model exhibits strong potential in estimating CVSS base scores with a high degree of correlation to expert opinion, making it a valuable decision-support tool for vulnerability assessment. The integration of domain knowledge with AI modeling appears to bridge semantic gaps while maintaining accuracy and consistency within human variability margins.

6.6. Limitations

The analysis faced dataset limitations stemming from LLM capabilities: the maximum token length per prompt restricted the amount of data processed simultaneously, forcing compromises in dataset size or granularity. Additionally, the inherent structure of the CVSS data made it challenging to obtain normally distributed values for each attribute individually, as attribute combinations influenced their distributions. Finally, Copilot was excluded from the comparative analysis due to excessive processing times and repeated failures to generate meaningful classifications.

7. Conclusions and Future Work

This study introduces a novel hybrid model that integrates expert-driven analysis with generative AI capabilities to enhance vulnerability scoring for sensor systems in autonomous vehicles. Our literature review and expert validation identified 15 key attack types, with the LiDAR sensor being the most targeted (9 attacks). CVSS v4.0 scores from 10 experts revealed that the most severe threats—eavesdropping (7.19), Sybil (6.76), and replay (6.35)—are predominantly network based, low complexity (92.3%), and require no privileges (51.3%) or user interaction (82.9%). AI models, when applied to a 117-entry vulnerability dataset, performed well on common classes: DeepSeek achieved 99.07% F1 for network vectors, ChatGPT led in low-complexity extraction, and Gemini applied a worst-case impact principle that favored cautious scoring in ambiguous cases. However, all models failed to accurately capture rare high-impact classes, with F1 scores dropping to 0%–15% for attributes like High Complexity and Changed Scope. These results underscore the effectiveness of our hybrid approach in achieving better contextualized and semi-automated scoring, while also highlighting critical gaps in current AI model capabilities when applied to specialized domains like automotive cybersecurity. Future research will focus on three primary directions to enhance the proposed model:
  • Domain-Specific Fine-Tuning of AI Models: Current generative AI models lack training on automotive cybersecurity datasets. Future work should involve fine-tuning LLMs using domain-specific corpora, including CVEs, threat intelligence feeds, and technical whitepapers related to vehicular sensor systems.
  • Integration of Real-Time Threat Intelligence: Enhancing the model with dynamic threat feeds and anomaly detection tools could improve responsiveness to emerging attack vectors and zero-day vulnerabilities.
  • Multi-Modal Vulnerability Analysis: Future iterations should consider incorporating additional data modalities (e.g., sensor logs and CAN bus traffic) for deeper behavioral analysis and richer context in vulnerability assessment.
  • Improved Scoring Mechanisms for Physical Attacks: Since physical-layer threats are currently underweighted by CVSS and misclassified by AI models, research should aim to augment or revise scoring criteria to better capture their practical implications in vehicular environments.
By pursuing these avenues, the proposed model can evolve into a robust, scalable, and adaptive solution for next-generation automotive cybersecurity risk assessment.

Author Contributions

Conceptualization, H.K.A. and I.T.A.H.; methodology, H.K.A. and I.T.A.H.; software, M.S.F.; validation, M.S.F., H.K.A. and I.T.A.H.; formal analysis, I.T.A.H. and M.S.F.; investigation, H.K.A. and M.S.F.; resources, M.S.F., H.K.A. and I.T.A.H.; data curation, M.S.F., H.K.A. and I.T.A.H.; writing—original draft preparation, M.S.F.; writing—M.S.F., H.K.A. and I.T.A.H.; visualization, M.S.F.; supervision, H.K.A. and I.T.A.H.; project administration, H.K.A. and I.T.A.H.; funding acquisition, M.S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. de Oliveira, V.F.; Matiolli, G.; Júnior, C.J.B.; Gaspar, R.; Lins, R.G. Digital twin and cyber-physical system integration in commercial vehicles: Latest concepts, challenges and opportunities. IEEE Trans. Intell. Veh. 2024, 9, 4804–4819. [Google Scholar] [CrossRef]
  2. Kennedy, J.; Holt, T.; Cheng, B. Automotive cybersecurity: Assessing a new platform for cybercrime and malicious hacking. J. Crime Justice 2019, 42, 632–645. [Google Scholar] [CrossRef]
  3. Raja, K.; Theerthagiri, S.; Swaminathan, S.V.; Suresh, S.; Raja, G. Harnessing Generative Modeling and Autoencoders Against Adversarial Threats in Autonomous Vehicles. IEEE Trans. Consum. Electron. 2024, 70, 6216–6223. [Google Scholar] [CrossRef]
  4. El-Rewini, Z.; Sadatsharan, K.; Sugunaraj, N.; Selvaraj, D.F.; Plathottam, S.J.; Ranganathan, P. Cybersecurity attacks in vehicular sensors. IEEE Sens. J. 2020, 20, 13752–13767. [Google Scholar] [CrossRef]
  5. Lautenbach, A.; Almgren, M.; Olovsson, T. Proposing HEAVENS 2.0—An Automotive Risk Assessment Model. In Proceedings of the 5th ACM Computer Science in Cars Symposium (CSCS ’21), Ingolstadt, Germany, 30 November 2021; pp. 1–6. [Google Scholar]
  6. Ward, D.; Wooderson, P. Automotive Cybersecurity: An Introduction to ISO/SAE 21434; SAE: Warrendale, PA, USA, 2021. [Google Scholar]
  7. Intel Corporation. Threat Analysis and Risk Assessment (TARA) Methodology; Intel White Papper; SAE International: Warrendale, PA, USA, 2015; Volume 1, pp. 1–20. [Google Scholar]
  8. Wang, Y.; Wang, Y.; Qin, H.; Ji, H.; Zhang, Y.; Wang, J. A Systematic Risk Assessment Framework for Automotive Cybersecurity. Automot. Innov. 2021, 4, 374–386. [Google Scholar] [CrossRef]
  9. Macher, G.; Schmittner, C.; Veledar, O.; Brenner, E. ISO/SAE DIS 21434 Automotive Cybersecurity Standard—In a Nutshell. In Computer Safety, Reliability, and Security, Proceedings of the SAFECOMP 2020 Workshops, Lisbon, Portugal, 15 September 2020; Casimiro, A., Ortmeier, F., Schoitsch, E., Bitsch, F., Ferreira, P., Eds.; Springer: Cham, Switzerland, 2020; pp. 123–135. [Google Scholar]
  10. Abouelnaga, M.; Jakobs, C. Security Risk Analysis Methodologies for Automotive Systems. arXiv 2023, arXiv:2307.02261. [Google Scholar]
  11. Sun, X.; Yu, F.R.; Zhang, P. A survey on cyber-security of connected and autonomous vehicles (CAVs). IEEE Trans. Intell. Transp. Syst. 2021, 23, 6240–6259. [Google Scholar] [CrossRef]
  12. Chowdhury, A.; Karmakar, G.; Kamruzzaman, J.; Jolfaei, A.; Das, R. Attacks on self-driving cars and their countermeasures: A survey. IEEE Access 2020, 8, 207308–207342. [Google Scholar] [CrossRef]
  13. Yassin, A.M.; Aslan, H.K.; Abdel Halim, I.T. Smart automotive diagnostic and performance analysis using blockchain technology. J. Sens. Actuator Netw. 2023, 12, 32. [Google Scholar] [CrossRef]
  14. Mell, P.; Scarfone, K.; Romanosky, S. Common vulnerability scoring system. IEEE Secur. Priv. 2006, 4, 85–89. [Google Scholar] [CrossRef]
  15. Beyrouti, M.; Lounis, A.; Lussier, B.; Bouabdallah, A.; Samhat, A.E. Vulnerability-oriented risk identification framework for IoT risk assessment. Internet Things 2024, 27, 101333. [Google Scholar] [CrossRef]
  16. Ur-Rehman, A.; Gondal, I.; Kamruzzaman, J.; Jolfaei, A. Vulnerability modelling for hybrid industrial control system networks. J. Grid Comput. 2020, 18, 863–878. [Google Scholar] [CrossRef]
  17. Kim, H.; Kim, D. Methodological Advancements in Standardizing Blockchain Assessment. IEEE Access 2024, 12, 35552–35570. [Google Scholar] [CrossRef]
  18. Figueroa-Lorenzo, S.; Añorga, J.; Arrizabalaga, S. A survey of IIoT protocols: A measure of vulnerability risk analysis based on CVSS. ACM Comput. Surv. (CSUR) 2020, 53, 44. [Google Scholar] [CrossRef]
  19. Zhang, Z.; Kumar, V.; Pfahringer, B.; Bifet, A. Ai-enabled automated common vulnerability scoring from common vulnerabilities and exposures descriptions. Int. J. Inf. Secur. 2025, 24, 16. [Google Scholar] [CrossRef]
  20. Kuhn, P.; Relke, D.N.; Reuter, C. Common vulnerability scoring system prediction based on open source intelligence information sources. Comput. Secur. 2023, 131, 286–298. [Google Scholar] [CrossRef]
  21. Hilario, E.; Azam, S.; Sundaram, J.; Imran Mohammed, K.; Shanmugam, B. Generative AI for pentesting: The good, the bad, the ugly. Int. J. Inf. Secur. 2024, 23, 2075–2097. [Google Scholar] [CrossRef]
  22. Mirtaheri, S.L.; Pugliese, A. Leveraging Generative AI to Enhance Automated Vulnerability Scoring. In Proceedings of the 2024 IEEE Conference on Dependable, Autonomic and Secure Computing (DASC), Boracay Island, Philippines, 5–8 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 57–64. [Google Scholar]
  23. Ferrara, E. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies. arXiv 2023, arXiv:2304.07683. [Google Scholar]
  24. Islam, T.; Sheakh, M.A.; Jui, A.N.; Sharif, O.; Hasan, M.Z. A review of cyber attacks on sensors and perception systems in autonomous vehicle. J. Econ. Technol. 2023, 1, 242–258. [Google Scholar] [CrossRef]
  25. Kaur, U.; Mahajan, A.N.; Kumar, S.; Dutta, K. Security Vulnerabilities in VANETs and SDN-based VANETS: A Study of Attacks. Int. J. Comput. Networks Appl. (IJCNA) 2024, 11, 774–802. [Google Scholar] [CrossRef]
  26. Mudhivarthi, B.R.; Thakur, P.; Singh, G. Aspects of cyber security in autonomous and connected vehicles. Appl. Sci. 2023, 13, 3014. [Google Scholar] [CrossRef]
  27. Gupta, M.; Akiri, C.; Aryal, K.; Parker, E.; Praharaj, L. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access 2023, 11, 80218–80245. [Google Scholar] [CrossRef]
  28. Worrell, J.L. A Survey of the Current and Emerging Ransomware Threat Landscape. EDP Audit. Control. Secur. Newsl. 2024, 69, 1–11. [Google Scholar] [CrossRef]
  29. Khadka, K. Forecasting Risks, Challenges, and Innovations: A Cybersecurity Perspective. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
  30. Saddi, V.R.; Gopal, S.K.; Mohammed, A.S.; Dhanasekaran, S.; Naruka, M.S. Examine the role of generative AI in enhancing threat intelligence and cyber security measures. In Proceedings of the 2024 2nd International Conference on Disruptive Technologies (ICDT), Greater Noida, India, 15–16 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 537–542. [Google Scholar]
  31. Alauthman, M.; Almomani, A.; Aoudi, S.; Al-Qerem, A.; Aldweesh, A. Automated Vulnerability Discovery Generative AI in Offensive Security. In Examining Cybersecurity Risks Produced by Generative AI; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 309–328. [Google Scholar]
  32. Sai, S.; Yashvardhan, U.; Chamola, V.; Sikdar, B. Generative ai for cyber security: Analyzing the potential of chatgpt, dall-e and other models for enhancing the security space. IEEE Access 2024, 12, 53497–53516. [Google Scholar] [CrossRef]
  33. Vadisetty, R.; Polamarasetti, A.; Prajapati, S.; Butani, J.B. Leveraging Generative AI for Automated Code Generation and Security Compliance in Cloud-Based DevOps Pipelines. SSRN Electron. J. 2023, 31, 1–11. [Google Scholar] [CrossRef]
  34. Grigorev, A.; Saleh, A.S.M.K.; Ou, Y. IncidentResponseGPT: Generating traffic incident response plans with generative artificial intelligence. arXiv 2024, arXiv:2404.18550. [Google Scholar]
  35. Kallonas, C.; Piki, A.; Stavrou, E. Empowering professionals: A generative AI approach to personalized cybersecurity learning. In Proceedings of the 2024 IEEE global engineering education conference (EDUCON), Kos Island, Greece, 8–11 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–10. [Google Scholar]
  36. MITRE Corporation. CVE Program. Available online: https://www.cve.org/ (accessed on 6 July 2025).
  37. National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). Available online: https://nvd.nist.gov/ (accessed on 6 July 2025).
  38. Biju, A.; Ramesh, V.; Madisetti, V.K. Security Vulnerability Analyses of Large Language Models (LLMs) through Extension of the Common Vulnerability Scoring System (CVSS) Framework. J. Softw. Eng. Appl. 2024, 17, 340–358. [Google Scholar] [CrossRef]
  39. Çaylı, O. AI-Enhanced Cybersecurity Vulnerability-Based Prevention, Defense, and Mitigation using Generative AI. Orclever Proc. Res. Dev. 2024, 5, 655–667. [Google Scholar] [CrossRef]
  40. Sharma, P.; Gillanders, J. Cybersecurity and forensics in connected autonomous vehicles: A review of the state-of-the-art. IEEE Access 2022, 10, 108979–108996. [Google Scholar] [CrossRef]
  41. Dong, C.; Chen, Y.; Wang, H.; Wang, L.; Li, Y.; Ni, D.; Zhao, D.; Hua, X. Evaluating impact of remote-access cyber-attack on lane changes for connected automated vehicles. Digit. Commun. Netw. 2024, 10, 1480–1492. [Google Scholar] [CrossRef]
  42. Raveling, A.; Qu, Y. Quantifying the Effects of Operational Technology or Industrial Control System based Cybersecurity Controls via CVSS Scoring. Eur. J. Electr. Eng. Comput. Sci. 2023, 7, 1–6. [Google Scholar] [CrossRef]
  43. Jakobsen, S.B.; Knudsen, K.S.; Andersen, B. Analysis of sensor attacks against autonomous vehicles. In Proceedings of the 8th International Conference on Internet of Things, Big Data and Security (IoTBDS), Prague, Czech Republic, 21–23 April 2023; SciTePress—Science and Technology Publications: Setúbal, Portugal, 2023; pp. 131–139. [Google Scholar] [CrossRef]
  44. Ansariyar, A. Investigating the Attacks and Defensive Mechanisms on Connected Vehicles (CVs). SSRN Electron. J. 2023. [Google Scholar] [CrossRef]
  45. Gupta, S.; Maple, C.; Passerone, R. An investigation of cyber-attacks and security mechanisms for connected and autonomous vehicles. IEEE Access 2023, 11, 90641–90669. [Google Scholar] [CrossRef]
  46. Hossain, S.M.; Banik, S.; Banik, T.; Shibli, A.M. Survey on Security Attacks in Connected and Autonomous Vehicular Systems. In Proceedings of the 2023 IEEE International Conference on Computing (ICOCO), Langkawi, Malaysia, 9–12 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 295–300. [Google Scholar]
  47. Andah, M.E.O. A Survey on Cyber Security Issues of Autonomous Vehicles; Technical Report; Carleton University: Ottawa, ON, Canada, 2021. [Google Scholar]
  48. Jadoon, A.K.; Wang, L.; Li, T.; Zia, M.A. Lightweight cryptographic techniques for automotive cybersecurity. Wirel. Commun. Mob. Comput. 2018, 2018, 1640167. [Google Scholar] [CrossRef]
  49. Bagga, P.; Das, A.K.; Wazid, M.; Rodrigues, J.J.; Park, Y. Authentication protocols in internet of vehicles: Taxonomy, analysis, and challenges. IEEE Access 2020, 8, 54314–54344. [Google Scholar] [CrossRef]
Figure 1. CVSS model.
Figure 1. CVSS model.
Futureinternet 17 00339 g001
Figure 2. In-vehicle network.
Figure 2. In-vehicle network.
Futureinternet 17 00339 g002
Figure 3. V2X communication.
Figure 3. V2X communication.
Futureinternet 17 00339 g003
Figure 4. Vehicle diagnostics.
Figure 4. Vehicle diagnostics.
Futureinternet 17 00339 g004
Figure 5. F1 score comparison by CVSS attribute and model.
Figure 5. F1 score comparison by CVSS attribute and model.
Futureinternet 17 00339 g005
Figure 7. The proposed hybrid Human–AI model.
Figure 7. The proposed hybrid Human–AI model.
Futureinternet 17 00339 g007
Figure 9. Residual analysis between Hybrid and Expert CVSS scores.
Figure 9. Residual analysis between Hybrid and Expert CVSS scores.
Futureinternet 17 00339 g009
Table 2. Survey results.
Table 2. Survey results.
Attack TypeAVACATPRUICIAAvg. CVSS Score (Std. Dev)
Blinding AttackP(50%)
N(20%)
A(20%)
L(10%)
L(70%)
H(30%)
N(80%)
P(20%)
N(80%)
L(10%)
H(10%)
N(70%)
R(30%)
N(50%)
L(40%)
H(10%)
N(20%)
L(60%)
H(20%)
N(10%)
L(30%)
H(60%)
5.09 (1.5)
Jamming AttackP(30%)
N(20%)
A(50%)
L(10%)
L(70%)
H(30%)
N(70%)
P(30%)
N(60%)
L(30%)
H(10%)
N(90%)
R(10%)
N(90%)
L(10%)
H(0%)
N(30%)
L(50%)
H(20%)
N(0%)
L(40%)
H(60%)
5.06 (1.2)
Black Hole AttackP(0%)
N(90%)
A(10%)
L(0%)
L(80%)
H(20%)
N(70%)
P(30%)
N(20%)
L(50%)
H(30%)
N(60%)
R(40%)
N(30%)
L(50%)
H(20%)
N(30%)
L(20%)
H(50%)
N(0%)
L(40%)
H(60%)
6.0 (1.8)
Timing AttackP(10%)
N(60%)
A(10%)
L(20%)
H(80%)
L(20%)
P(70%)
N(30%)
N(10%)
L(60%)
H(30%)
N(70%)
R(30%)
N(10%)
L(40%)
H(50%)
N(10%)
L(40%)
H(50%)
N(10%)
L(50%)
H(40%)
5.72(1.6)
Disruptive AttackP(70%)
N(10%)
A(10%)
L(10%)
L(60%)
H(40%)
N(60%)
P(40%)
N(40%)
L(30%)
H(30%)
N(60%)
R(40%)
N(50%)
L(20%)
H(30%)
N(30%)
L(10%)
H(60%)
N(0%)
L(10%)
H(90%)
4.66 (1.9)
Replay AttackP(0%)
N(70%)
A(30%)
L(0%)
L(20%)
H(80%)
N(80%)
P(20%)
N(20%)
L(10%)
H(70%)
N(30%)
R(70%)
N(10%)
L(10%)
H(80%)
N(0%)
L(10%)
H(90%)
N(30%)
L(0%)
H(70%)
6.35 (1.7)
Relay AttackP(0%)
N(30%)
A(60%)
L(10%)
L(30%)
H(70%)
N(20%)
P(80%)
N(20%)
L(30%)
H(50%)
N(60%)
R(40%)
N(0%)
L(40%)
H(60%)
N(20%)
L(20%)
H(60%)
N(60%)
L(20%)
H(20%)
5.43 ( 1.4)
Eavesdropping AttackP(0%)
N(60%)
A(10%)
L(30%)
L(90%)
H(10%)
N(60%)
P(40%)
N(50%)
L(30%)
H(20%)
N(80%)
R(20%)
N(0%)
L(20%)
H(80%)
N(40%)
L(10%)
H(50%)
N(60%)
L(0%)
H(40%)
7.19 ( 1.3)
Sybil AttackP(0%)
N(70%)
A(30%)
L(0%)
L(10%)
H(90%)
N(10%)
P(90%)
N(40%)
L(0%)
H(60%)
N(60%)
R(40%)
N(20%)
L(20%)
H(60%)
N(0%)
L(20%)
H(80%)
N(10%)
L(20%)
H(70%)
6.76 (1.6)
Blind Spot ExploitationP(50%)
N(10%)
A(40%)
L(0%)
L(70%)
H(30%)
N(60%)
P(40%)
N(70%)
L(10%)
H(20%)
N(80%)
R(20%)
N(80%)
L(10%)
H(10%)
N(50%)
L(20%)
H(30%)
N(10%)
L(40%)
H(50%)
4.4 (1.1)
Acoustic AttackP(20%)
N(10%)
A(70%)
L(0%)
L(80%)
H(20%)
N(70%)
P(30%)
N(90%)
L(10%)
H(0%)
N(90%)
R(10%)
N(70%)
L(30%)
H(0%)
N(20%)
L(60%)
H(20%)
N(20%)
L(30%)
H(50%)
5.03 ( 1.5)
Sensor Interference AttackP(70%)
N(0%)
A(20%)
L(10%)
L(20%)
H(80%)
N(60%)
P(40%)
N(60%)
L(20%)
H(20%)
N(80%)
R(20%)
N(50%)
L(30%)
H(20%)
N(10%)
L(30%)
H(60%)
N(30%)
L(30%)
H(40%)
4.06 ( 1.7)
Impersonation AttackP(10%)
N(40%)
A(30%)
L(20%)
L(20%)
H(80%)
N(10%)
P(90%)
N(60%)
L(10%)
H(30%)
N(60%)
R(40%)
N(20%)
L(30%)
H(50%)
N(10%)
L(30%)
H(60%)
N(30%)
L(30%)
H(40%)
5.41 ( 1.4)
Falsified Information AttackP(10%)
N(50%)
A(40%)
L(00%)
L(20%)
H(80%)
N(30%)
P(70%)
N(40%)
L(10%)
H(50%)
N(60%)
R(40%)
N(50%)
L(10%)
H(40%)
N(10%)
L(10%)
H(80%)
N(20%)
L(30%)
H(50%)
6.22 (1.8)
Cloaking AttackP(10%)
N(30%)
A(20%)
L(40%)
L(10%)
H(90%)
N(10%)
P(90%)
N(20%)
L(30%)
H(50%)
N(80%)
R(20%)
N(30%)
L(0%)
H(70%)
N(0%)
L(10%)
H(90%)
N(20%)
L(30%)
H(50%)
6.66 (1.5)
Table 3. ChatGPT survey result.
Table 3. ChatGPT survey result.
Attack TypeAVACATPRUICIAAvg. CVSS Score
Blinding AttackPhysicalLowNoneNoneNoneNoneHighLow5.2
Jamming AttackNetworkLowNoneNoneNoneNoneNoneHigh8.7
Black HoleAdjacentLowPresentLowNoneNoneLowHigh5.9
TimingNetworkHighPresentNoneNoneHighNoneNone8.2
DisruptiveNetworkLowNoneNoneNoneNoneNoneHigh8.7
Replay attacksNetworkLowNoneNoneNoneNoneHighNone8.7
Relay attackPhysicalLowNoneNoneNoneNoneHighNone5.1
Eavesdropping attacksNetworkLowNoneNoneNoneHighNoneNone8.7
Sybil attackNetworkLowPresentNoneNoneNoneHighHigh8.3
Blind Spot exploitationPhysicalLowPresentNoneNoneNoneHighLow4.3
Sensor Interference AttackNetworkLowPresentNoneNoneNoneHighHigh8.3
Acoustic attackPhysicalHighPresentNoneNoneNoneLowHigh4.3
Impersonation attackNetworkLowNoneLowNoneNoneHighNone7.1
Falsified Information attackNetworkLowNoneNoneNoneNoneHighHigh8.8
Cloaking attacksNetworkHighPresentHighNoneNoneHighNone5.9
Table 4. DeepSeek survey result.
Table 4. DeepSeek survey result.
Attack TypeAVACATPRUICIAAvg. CVSS Score
Blinding AttackPhysicalLowPresentNoneNoneNoneHighHigh4.4
Jamming AttackNetworkLowPresentNoneNoneLowLowHigh8.3
Black Hole AttackNetworkLowPresentNoneNoneLowHighHigh8.4
Timing AttackNetworkHighPresentLowNoneLowHighHigh6.1
Disruptive AttackNetworkLowNoneNoneNoneNoneLowHigh8.8
Replay AttackNetworkLowPresentNoneNoneHighHighLow9.2
Relay AttackNetworkLowPresentNoneNoneHighHighLow9.2
Eavesdropping AttackNetworkLowPresentNoneNoneHighNoneNone8.2
Sybil AttackNetworkHighPresentNoneNoneHighHighLow9.2
Blind Spot ExploitationPhysicalLowPresentNoneNoneNoneHighHigh4.4
Sensor Interference AttackPhysicalLowPresentNoneNoneNoneHighHigh4.4
Acoustic AttackPhysicalHighPresentNoneNoneNoneHighHigh4.4
Impersonation AttackNetworkHighPresentLowNoneHighHighLow9.2
Falsified Information AttackNetworkHighPresentLowNoneHighHighLow7.6
Cloaking AttackNetworkHighPresentNoneNoneHighHighLow9.2
Table 5. Copilot survey result.
Table 5. Copilot survey result.
Attack TypeAVACATPRUICIAAvg. CVSS Score
Blinding AttackPhysicalHighPresentNoneNoneHighHighHigh5.4
Jamming AttackAdjacentLowPresentNoneNoneLowLowHigh6.1
Black Hole AttackNetworkLowPresentNoneNoneHighHighHigh9.2
Timing AttackLocalHighPresentHighNoneHighHighLow7.1
Disruptive AttackPhysicalHighPresentNoneNoneLowLowHigh4.4
Replay AttackNetworkLowPresentNoneNoneHighHighHigh9.2
Relay AttackAdjacentLowPresentNoneNoneHighHighHigh7.7
Eavesdropping AttackAdjacentLowNoneNoneNoneHighNoneNone7.1
Sybil AttackNetworkHighPresentLowNoneHighHighHigh7.7
Blind Spot ExploitationPhysicalHighPresentNoneNoneLowHighHigh4.6
Sensor Interference AttackPhysicalHighPresentNoneNoneLowHighHigh4.6
Acoustic AttackPhysicalHighPresentNoneNoneNoneNoneHigh4.1
Impersonation AttackNetworkHighPresentLowNoneHighHighHigh7.7
Falsified Information AttackNetworkHighPresentNoneNoneHighHighHigh9.2
Cloaking AttackNetworkHighPresentNoneNoneHighHighHigh9.2
Table 6. Gemini survey result.
Table 6. Gemini survey result.
Attack TypeAVACATPRUICIAAvg. CVSS Score
Blinding AttackNetworkHighPresentNoneNoneNoneHighHigh8.3
Jamming AttackNetworkLowPresentNoneNoneNoneHighHigh8.3
Black Hole AttackNetworkHighPresentLowNoneNoneHighHigh6.1
Timing AttackNetworkHighPresentLowNoneHighHighHigh7.7
Disruptive AttackNetworkHighPresentLowNoneNoneHighHigh6.1
Replay AttackNetworkHighPresentNoneNoneNoneHighHigh8.3
Relay AttackNetworkHighPresentNoneNoneLowHighHigh8.4
Eavesdropping AttackNetworkLowPresentNoneNoneHighNoneNone8.2
Sybil AttackNetworkHighPresentLowNoneHighHighHigh7.7
Blind Spot ExploitationLocalLowPresentNoneNoneNoneHighHigh5.9
Sensor Interference AttackNetworkHighPresentNoneNoneNoneHighHigh8.3
Acoustic AttackLocalHighPresentNoneNoneLowHighHigh6.0
Impersonation AttackNetworkHighPresentLowNoneHighHighHigh7.7
Falsified Information AttackNetworkHighPresentLowNoneHighHighHigh7.7
Cloaking AttacksNetworkHighPresentLowNoneHighHighHigh7.7
Table 7. ChatGPT evaluation.
Table 7. ChatGPT evaluation.
MetricClassPrecisionRecallF1 ScoreSupportAccuracy
Attack VectorNETWORK89.13%84.54%86.77%9780.62%
LOCAL66.67%74.07%70.18%27
ADJACENT28.57%40.00%33.33%5
Attack ComplexityLOW90.55%95.83%93.12%12087.12%
HIGH0%0%0%12
ScopeCHANGED60.00%30.00%40.00%2085.25%
UNCHANGED87.50%96.08%91.59%102
User InteractionREQUIRED66.67%40.00%50.00%3081.82%
NONE84.21%94.12%88.89%102
Privilege RequiredNONE78.13%83.33%80.65%6069.75%
LOW66.67%66.67%66.67%45
HIGH30.00%21.43%24.99%14
ConfidentialityNONE72.92%70.00%71.43%5059.83%
LOW53.19%62.50%57.47%40
HIGH45.45%37.04%40.82%27
IntegrityNONE69.77%66.67%68.18%4559.28%
LOW56.00%62.22%58.95%45
HIGH45.00%39.13%41.86%23
AvailabilityNONE62.86%62.86%62.86%3551.61%
LOW51.43%51.43%51.43%35
HIGH34.78%34.78%34.78%23
Table 8. DeepSeek evaluation.
Table 8. DeepSeek evaluation.
MetricClassPrecisionRecallF1 ScoreAccuracy
Attack VectorNETWORK98.15%100.00%99.07%98.29%
LOCAL100.00%81.82%89.92%
Attack ComplexityLOW93.91%98.18%96.00%92.31%
HIGH0%0%0%
ScopeCHANGED62.50%38.46%47.62%90.60%
UNCHANGED92.66%97.12%94.81%
User InteractionREQUIRED75.00%44.44%55.81%83.76%
NONE85.15%95.56%90.00%
Privilege RequiredNONE90.00%93.75%91.84%84.21%
LOW80.65%75.76%78.13%
HIGH71.43%71.43%71.43%
ConfidentialityNONE83.33%85.71%84.51%75.58%
LOW68.97%68.97%68.97%
HIGH71.43%68.18%69.77%
IntegrityNONE83.33%83.33%83.33%73.17%
LOW66.67%73.33%69.84%
HIGH68.42%59.09%63.41%
AvailabilityNONE72.00%78.26%75.00%65.15%
LOW65.22%60.00%62.50%
HIGH52.63%55.56%54.05%
Table 9. Gemini evaluation.
Table 9. Gemini evaluation.
MetricClassPrecisionRecallF1 ScoreAccuracy
Attack VectorNETWORK97.14%98.08%97.60%95.73%
LOCAL83.33%76.92%80.00%
Attack ComplexityLOW94.83%95.65%95.24%90.98%
HIGH16.67%14.29%15.38%
ScopeCHANGED61.54%40.00%48.48%86.07%
UNCHANGED89.00%95.10%91.95%
User InteractionREQUIRED82.35%43.75%57.14%82.50%
NONE82.52%96.59%89.01%
Privilege RequiredNONE87.72%90.91%89.29%80.37%
LOW77.78%73.68%75.68%
HIGH57.14%57.14%57.14%
ConfidentialityNONE83.33%87.50%85.37%77.55%
LOW69.44%75.76%72.46%
HIGH80.00%64.00%71.11%
IntegrityNONE80.00%80.00%80.00%71.43%
LOW69.77%75.00%72.29%
HIGH60.00%52.17%55.81%
AvailabilityNONE68.97%74.07%71.43%63.75%
LOW66.67%60.00%63.16%
HIGH54.17%56.52%55.32%
Table 11. Model performance on imbalanced dataset.
Table 11. Model performance on imbalanced dataset.
ModelAccuracyPrecision (L)Precision (H)Recall (L)Recall (H)F1 Score (L)F1 Score (H)Macro F1
TF-IDF + Logistic Regression0.8210.8820.4000.9090.3330.8960.3640.630
BERT0.8460.8650.5000.9700.1670.9140.2500.582
USE0.7950.8790.3330.8790.3330.8790.3330.606
SVM0.8210.8820.4000.9090.3330.8960.3640.630
BERT_Enhanced0.8210.8820.4000.9090.3330.8960.3640.630
Table 12. Model performance on balanced dataset.
Table 12. Model performance on balanced dataset.
ModelAccuracyPrecision (L)Recall (L)F1 Score (L)Precision (H)Recall (H)F1 Score (H)Time
TF-IDF + Logistic Regression0.870.950.780.860.810.960.881 s
USE0.700.720.660.690.690.740.71211 s
SVM (TF-IDF)0.870.930.800.860.820.940.881 s
Best Tuned USE (Threshold = 0.5)0.630.690.480.560.600.780.6851 s
Tuned USE (Custom Threshold)0.641.000.280.440.581.000.741 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Farghaly, M.S.; Aslan, H.K.; Abdel Halim, I.T. A Hybrid Human-AI Model for Enhanced Automated Vulnerability Scoring in Modern Vehicle Sensor Systems. Future Internet 2025, 17, 339. https://doi.org/10.3390/fi17080339

AMA Style

Farghaly MS, Aslan HK, Abdel Halim IT. A Hybrid Human-AI Model for Enhanced Automated Vulnerability Scoring in Modern Vehicle Sensor Systems. Future Internet. 2025; 17(8):339. https://doi.org/10.3390/fi17080339

Chicago/Turabian Style

Farghaly, Mohamed Sayed, Heba Kamal Aslan, and Islam Tharwat Abdel Halim. 2025. "A Hybrid Human-AI Model for Enhanced Automated Vulnerability Scoring in Modern Vehicle Sensor Systems" Future Internet 17, no. 8: 339. https://doi.org/10.3390/fi17080339

APA Style

Farghaly, M. S., Aslan, H. K., & Abdel Halim, I. T. (2025). A Hybrid Human-AI Model for Enhanced Automated Vulnerability Scoring in Modern Vehicle Sensor Systems. Future Internet, 17(8), 339. https://doi.org/10.3390/fi17080339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop