In this paper, we propose a vulnerability severity assessment model integrating clustering feature analysis and the Transformer method, which comprises four hierarchical layers, depicted in
Figure 2. First, the word-embedding layer processes vulnerability information text through data cleaning and lexical preprocessing, converting tokens into node vectors. This layer incorporates bidirectional text encoding and token masking mechanisms to enhance the positional awareness of Common Vulnerability Scoring System vector strings [
25]. Secondly, the TextCNN-based [
26] feature extraction layer captures local textual patterns while integrating clustered features. Third, the Transformer-TextCNN feature fusion layer amalgamates vulnerability characteristics extracted via clustering, further strengthening the model’s positional sensitivity to vector strings. Finally, the fully connected output classification layer synthesizes distributed features, generates one-hot probability distributions, and ultimately produces predicted classification.
3.1. Clustering Feature Analysis
In this paper, we propose clustering-based data mining analysis as its foundational motivation, uncovering a novel method for predicting CVSS scores. It utilizes the results of clustering feature analysis as feature weights during model training.
Given the complexity of raw textual data, dimensionality reduction becomes essential for processing the extracted features from Common Vulnerabilities and Exposures vulnerability text. Principal Component Analysis (PCA) [
27] serves as the dimensionality reduction technique, with empirical findings indicating optimal performance at two dimensions. This approach sufficiently preserves feature integrity while significantly streamlining data complexity [
28]. Subsequent standardization processing enables the revelation of latent vulnerability behavioral patterns within the data. The methodology involves in-depth analysis of textual information through the CVSS framework to examine inter-component relationships across vulnerability elements. K-Means clustering then excavates underlying threat patterns from this analysis. Simultaneously, a postulated Gaussian Mixture Model investigates Gaussian distributions across vulnerability datasets, simulating vulnerability propagation dynamics throughout their lifecycle evolution [
29].
The desired outcome for each cluster is maximized intra-cluster similarity and minimized inter-cluster similarity. To quantify cluster separability, this methodology introduces Minkowski distance measurement
and cosine similarity assessment
as dual quantitative metrics.
In the formula above,
denotes the lateral separation distance between clusters,
is the defined dimensionality and
is the order of distance, while
represent the horizontal and vertical separation distance.
By integrating the following three comparative methods and applying the elbow criterion, it is possible to determine an effective combination of dimensionality reduction parameters and cluster count [
30]:
(1) Sum of Squared Errors Criterion: A smaller sum of squared errors indicates better clustering performance, which measures the predictive accuracy of a model by calculating the sum or average of squared differences between predicted and actual values, as follows:
where
denotes observed values and
represents predicted values, followed by summation over all samples. This formula quantifies the discrepancy between predicted and observed values.
(2) Silhouette Coefficient: This metric evaluates the impact of different algorithms or algorithmic configurations on clustering results using identical raw data and quantifies how well each data point fits into its assigned cluster relative to other clusters. Normalized within [−1, 1], values closer to 1 signify superior intra-cluster cohesion and inter-cluster separation.
Specifically,
measures the cohesion of a sample point, where
denotes other samples within the same cluster as sample
, and distance computes the distance of
.
is computed similarly to
, but requires iterating through other clusters to obtain multiple values, with the minimum selected as the final result. A smaller
indicates tighter cluster cohesion.
(3) Calinski–Harabasz Index: Defined as the ratio of between-cluster dispersion to within-cluster dispersion and quantifying the compactness and separation of clusters by calculating the ratio of between-cluster variance to within-cluster variance, where higher values denote better clustering outcomes.
where
represents the Calinski–Harabasz Index.
denotes the between-cluster scatter matrix, reflecting dispersion among cluster centroids, while
is the within-cluster scatter matrix, indicating dispersion among samples within the same cluster.
indicates the total number of vulnerability data points, and k represents the number of clusters.
However, considering practical constraints, excessively low dimensionality reduction may lead to an insufficient data volume in some clusters. Therefore, we select a configuration of 2-dimensional reduction with 4 clusters, which significantly mitigates the sample distribution imbalance issue [
31]. The resulting clustering outcomes are visually presented in
Figure 3, where the
x-axis represents the first principal component, and the
y-axis represents the second principal component. Principal components are linear combinations of all original features. The first principal component (PC1) indicates the direction that maximizes data variance, while the second principal component (PC2) is orthogonal to PC1 and explains the maximum remaining variance.
In this paper, we propose heatmap visualization to analyze the clustered data, examining the relationships and magnitudes of influence between individual metrics in CVSS vector strings and multidimensional CVSS scores. The heatmap illustrating the clustering analysis is depicted in
Figure 4.
Consequently, the vulnerability severity assessment method proposed in this paper subdivides CVSS score prediction into eight distinct subproblems, each corresponding to individual metrics within the CVSS vector string [
32]. By predicting outcomes for each metric, the approach synthesizes the predicted CVSS vector string, with the final CVSS scores thus calculated. This research designs a BERT-based deep learning model for CVSS scores prediction. The model separately trains and predicts each metric within the CVSS vector string, aggregates results, and consequently computes the CVSS scores for a given vulnerability instance.
3.3. Transformer Adaptive Latent Space Feature Fusion
We propose LSNet, which integrates the self-attention mechanism of the Transformer method to capture global feature relationships within sequences, subsequently incorporating residual networks, layer normalization, and feed-forward networks to enhance model performance and training efficiency. For the transformer-encoded vulnerability text data
, the hidden states of the input sequence
(where
represents the sequence length and
denotes the hidden layer dimension) undergo linear transformations to generate query (
), key (
), and value (
) matrices as follows:
where
,
, and
, respectively, represent the query, key, and value matrices for the head, and
,
,
denote learnable weight matrices. We then use the following equation:
where
signifies the dimension of the key vector. Subsequently, we concatenate the attention scores
, obtained from h attention heads, and then multiply this concatenated result by the output projection weight matrix to obtain matrix
[
21].
Then, TextCNN is used for feature extraction, where a textual description of vulnerability
is first passed through the convolutional layer to filter individual word vectors. The resulting vectors are then utilized as inputs to TextCNN. Using four adaptively sized convolutional kernels, the model extracts feature maps by performing convolution operations on the text sequence [
26]. The resulting feature map sequences, denoted as
, incorporate the clustering vector association weights
obtained from
Section 3.1 as convolutional bias, as illustrated below:
where
denotes vertical stacking, while
represents the convolution operation,
, determined by the optimizer. Leveraging TextCNN, we propose multi-channel, multi-kernel convolution operations on feature representation vectors generated by the Transformer encoder. This process extracts local commonality features across varying granularities of textual words and syntactic structures. Subsequently, maximum pooling is applied to achieve feature dimensionality reduction.
The pooled feature vectors are fed into the representation integration layer for classification operations and undergo nonlinear transformation via an adaptive activation function. The concatenated feature vector
is computed with the fully connected layer’s weight matrix
and bias term
.
where
denotes the weight matrix of the representation integration layer, and
represents the bias term that dynamically adjusts via adaptive attention mechanisms. The resulting
constitutes the nonlinearly transformed string feature vector.
The feature distribution
, after max pooling in the feature fusion layer, is input into a fully connected layer for classification, transforming it into a feature probability vector
. This vector undergoes normalization via the softmax function, enabling multi-class classification of each metric within the CVSS vector string to obtain their corresponding one-hot probabilities:
In this process, LSNet outputs the label value for each CVSS vector string. It applies the softmax function to determine the class with the highest probability, assigns a value of 1 to that class, and sets all other classes to 0, thereby obtaining the definitive label value for each CVSS vector string.
During experimental training, the model uses cross entropy as the loss function to quantify the discrepancy between predictions and ground truth labels. The Adam optimizer utilizes backpropagation to iteratively refine training parameters. This process progressively reduces classification loss while enhancing the model’s classification accuracy. The specific calculation for the cross entropy loss function is as follows:
where
represents the ground truth (GT) label,
denotes the total number of entries in the CVSS vector string, and
represents the model-predicted probabilities.