Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction

Chen, Long; Deng, Hongli; Zhang, Jun; Zheng, Bochuan; Jiang, Rui

doi:10.3390/sym17050783

Open AccessArticle

Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction

by

Long Chen

¹

,

Hongli Deng

^2,*,

Jun Zhang

³

,

Bochuan Zheng

^3,*

and

Rui Jiang

³

¹

School of Electronic Information Engineering, China West Normal University, Nanchong 637000, China

²

Education Information Technology Center, China West Normal University, Nanchong 637000, China

³

School of Computer Science, China West Normal University, Nanchong 637000, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(5), 783; https://doi.org/10.3390/sym17050783

Submission received: 29 March 2025 / Revised: 1 May 2025 / Accepted: 14 May 2025 / Published: 19 May 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Threat intelligence is crucial for the early detection of network security threats, and named entity recognition (NER) plays a critical role in this process. However, traditional NER models based on sequence tagging primarily focus on word-level information for single-token entities, which leads to the inefficient extraction of multi-token entities in intelligence texts. Moreover, traditional NER models provide only a single semantic representation of intelligence texts, meaning that polysemous entities in intelligence texts cannot be effectively classified. To address these problems, this paper proposes a novel model based on segment-level information extraction and similar semantic space construction (SSNER). First, SSNER retrains the traditional BERT model based on a threat intelligence corpus and modifies BERT’s mask mechanism to extract the segment-level word embedding so that the ability of the SSNER to recognize multi-token entities is enhanced. Second, SSNER designs a similar semantic space construction method, which obtains compound semantic representations with symmetrical properties by filtering out the set of similar words and integrating them using self-attention to generate more accurate labels for the polysemous entities. The experimental results on two datasets, DNRTI and Bridges, indicate that SSNER outperforms both baseline and related models. In particular, SSNER achieves an F1-score of 96.02% on the Bridges dataset, exceeding the previous best model by approximately 1.46%.

Keywords:

threat intelligence; semantic enhancement; pre-trained model; named entity recognition

1. Introduction

According to CrowdStrike’s Global Threat Report (CrowdStrike, Inc., Sunnyvale, CA, USA), the number of cyber-attacks has increased by nearly 70% in the past two years [1]. Therefore, the effective prevention of cyber-attacks has emerged as a critical concern. Cyber Threat Intelligence (CTI) analysis is recognized as an appropriate security alerting mechanism that shifts organizations from a reactive to a proactive stance while defending against cyber threats. Consequently, CTI is not only seen as a simple knowledge base but a knowledge base equipped with knowledge-driven mechanisms that enable proactive decision-making [2].

The field of threat intelligence research is mainly divided into three stages: Named Entity Recognition (NER), Relation Extraction (RE), and Text Summarization (TS). Among them, NER is usually aimed at identifying and classifying key entities (e.g., attackers, malware, IP addresses, etc.) from CTI texts [3]. RE refers to identifying relationships between entities from CTI texts and further analyzing the relationships between them [4]. TS is often used in information overload situations to help users quickly access key information [5]. In these tasks, the efficiency of NER task extraction directly affects the efficiency of RE and TS tasks. Therefore, research based on NER in CTI has become a hotspot for cybersecurity experts over the past few years.

The NER task is composed of two parts, identifying the entity and detecting the entity boundary. In the early days, research on NER in cyber threat intelligence (CTI) can be categorized into two types: rule-based [6] and statistical feature-based [7,8]. Rule-driven methods rely on established rule sets and reference dictionaries to identify and extract entities. In contrast, statistical methods capture text features through carefully designed feature engineering and utilize machine learning techniques to construct models capable of recognizing entities.

With the advancement of neural networks, deep learning is now the mainstream method for NER tasks in intelligence text analysis. Compared to the rule-based and statistical feature-based methods, deep learning methods show better performance in extracting entities by utilizing neural networks [9,10]. In recent years, with the advancements of Large Language Models (LLMs), more effective approaches and tools are offered for threat intelligence tasks [11,12,13].

However, with the diversification of cyberattacks, the intelligence text contains more multi-token entity information and the semantic representation of intelligence textual entities becomes increasingly complex. Traditional deep learning based on NER methods has challenges in accurately extracting continuous multi-token entities and classifying polysemous entities of CTI. To address these issues, this paper presents a threat intelligence named entity recognition model based on segment-level information extraction and similar semantic space construction. This model aims to achieve more precise entity extraction by extracting segment-level information and building compound semantics.

This research extracts entities from unstructured threat texts that reflect the early attack trend of the network to a certain extent, thereby providing a foundational basis for subsequent threat intelligence research. The main contributions of our work are listed below:

(1): This paper proposes a novel segmented masking enhancement mechanism (SME). By utilizing SME, SSNER retrains the traditional BERT model based on the CTI corpus and modifies BERT’s mask mechanism to extract the segment-level word embedding in intelligence texts. This method improves the ability of the SSNER to extract continuous multi-token entities.
(2): This paper proposes a novel Semantic Collaborative Embedding mechanism (SCE). SSNER constructs a similar semantic space for each word in threat intelligence by filtering out the set of similar words and integrating them using self-attention. In this way, SSNER has achieved more effectiveness in classifying polysemy entities.

The rest of the paper is organized as follows: Section 2 reviews related work and problem analysis. Section 3 introduces the SSNER model and its two core mechanisms. Section 4 presents the experimental setup and evaluation results. Section 5 discusses experimental hyperparameter setting and case analysis. Finally, Section 6 concludes the paper and future directions.

2. Related Works

2.1. Deep Learning Methods for NER in Cybersecurity

In recent years, with the continuous improvement in computing power and the arrival of the era of big data, neural networks have developed rapidly, especially in the field of deep learning. Researchers have increasingly focused on NER neural network models in threat intelligence texts. In 2015, Huang et al. [14] proposed a method based on Bidirectional Long Short-Term Memory (BiLSTM). This approach became the first team to combine BiLSTM with Conditional Random Fields (CRF) for a sequence labeling task, and experimental results validated the effectiveness of the method. In 2017, Husari et al. [15] proposed a tool called Ttpdrill that is designed to automatically extract threat actions from unstructured text sources such as CTI reports. This approach uses NLP techniques to identify and extract relevant information from the text and has been evaluated on a dataset of CTI reports. The results show that Ttpdrill can extract threat actions with high accuracy. In 2018, Kim et al. [16] proposed a framework called CyTIME that leverages structured CTI data from repositories. This system autonomously gathers intelligence data, continuously generates rules without human intervention, and converts these rules into a JSON-based format known as Structured Threat Information Expression (STIX) to mitigate real-time network cyber threats.

In 2019, Qin et al. [17] were the first to utilize the Convolutional Neural Network (CNN), BiLSTM, and CRF structure to process intelligence text, where CNN extracts character features and then inputs these features into the BiLSTM network to learn global word feature representation. This approach captured the character features of entities using CNN and improved the robustness of the model and the efficiency of entity extraction. In 2020, Piplai et al. [18] proposed a framework that extracts cybersecurity information from the after-action reports and represents that in a knowledge graph, thereby enabling insightful analysis for cybersecurity professionals. Their system uses NER and regular expressions to accurately identify cyber-related entities. In 2021, Gao et al. [19] proposed a data- and knowledge-driven NER network to address the poor extraction of rare entities in cybersecurity terminology. This approach improved the model’s ability to understand the vocabulary of cybersecurity entities and increased the efficiency of entity extraction by introducing external domain knowledge. In 2021, Alves et al. [20] proposed a Twitter streaming threat monitor that continuously updates a summary of threats related to the target. They collected tweets that pertain to cyber security incidents and extracted features using the term frequency-inverse document frequency method (TF-IDE) Subsequently, they employed both a multi-layer perceptron and support vector machine to be used as classifiers for the collected tweets.

In 2022, Liu et al. [21] proposed an approach based on a multi-feature semantic augmentation network to address the challenge of complex semantics in threat intelligence texts. This approach is particularly important for entities in threat intelligence texts, where traditional single semantic models are unable to represent the fuzzy semantics of network security entities.

2.2. Large Language Model Methods for NER in Cybersecurity

In recent years, the introduction of Transformer models has enabled the integration of numerous large language models, thanks to their efficient parallel processing and powerful representational abilities.

In 2021, Zhou et al. [22] applied the BERT+BiLSTM+CRF model to cybersecurity NER tasks. This method improved the efficiency of entity extraction through BERT’s bidirectional context understanding and pre-training capabilities. In 2022, Aghaei et al. [23] proposed the SecureBERT model which is specifically trained for the cybersecurity domain. By retraining the threat intelligence corpus, the model can better understand the semantic relationship of intelligence text. In 2022, Zhou et al. [12] proposed a framework called CTI View, which is designed to automatically extract cyber threat intelligence (CTI) from unstructured APT-related texts. The approach involves identifying threat entities and training the model using a BERT-GRU-BiLSTM-CRF architecture, which is built upon the bidirectional transformer-encoded representations provided by BERT. Furthermore, Yan et al. proposed a threat intelligence analysis methodology based on feature weighting and a BERT +BiGRU network structure to improve the detection and response to cybersecurity threats in IIoT environments. This method combined BERT’s contextual text understanding and BiGRU’s temporal feature capture capabilities to improve the classification accuracy of attack behaviors and strategies [24].

In 2023, Wang et al. [25] proposed a FIEBD model to solve the problem of hard-to-detect entity boundaries and fuzzy action of neural networks. This method utilized the PERT model to detect entity boundaries and the graph attention networks (GATs) to extract entity categories; therefore, entities can be efficiently extracted by combining neural networks with language models.

2.3. Limitations of Traditional NER Methods

As mentioned before, traditional NER models face two main challenges when processing intelligence texts: ① the incomplete extraction of multi-token entities and ② low classification accuracy for polysemy entities. The detailed description is shown below.

2.3.1. Incomplete Extraction of Multi-Token Entities

Traditional NER models based on sequence labeling often use large language model (LLM) BERT to process text input. However, BERT uses a random masking strategy in the mask language model (MLM). When performing a mask operation on the input text, it only makes independent judgments on a single token. Thus, from a probabilistic perspective, it is challenging to apply MASK processing to continuous multi-token entities.

(1): Introduction to the MASK mechanism in MLM

The MASK mechanism in traditional MLM infers and obtains text information by masking “random tags” in the input sequence. Traditional training is also one of the critical links in the pre-training process. During the pre-training process, tokens have a 15% chance for mask processing. Except for the first and middle separator words, 80% of the selected word tokens will be marked with [MASK]. In total, 10% of the selected tokens are replaced by random words and 10% of the marked words remain unchanged. The specific example of the masking strategy is shown in Figure 1.

(2): Analysis of examples of multi-token entities that are difficult to identify

To better describe the problem, this paper introduces the ternary {position, entity, MASK} for the current state, the first “position” represents the position of the token, the second “entity” represents the content of the token, and the third “MASK” represents the state of the token. In Figure 1, X = “R* Technology software company suffered ATP attack in September” is the input text. MLM randomly masks 15% of the words in X. If “R*” is masked, the {0, “R*”, 0} will change to {0, “R*”, 1}. In Figure 2, The word “Technology” is masked, thus its ternary changes to {0, “Technology”, 1}. The word “ATP” is replaced by “play”, which results in its ternary representation becoming {0, “ATP”, play}.

Moreover, to better characterize the probability that an entity will be masked, we propose the following Equation (1).

P = p_{1} \times p_{2} . . . p_{i} {. . . p}_{n}

(1)

where p is the probability of each word being masked (1 ≤ i ≤ n) and n denotes the number of tokens of this entity (the length of the entity).

As shown in Figure 1, “R* Technology software company” is an organizational entity. Based on Equation (1), the probability of an entity being masked will be calculated. According to the mask probability of ordinary BERT = 15%, the probability that “R* Technology software company” will be masked will be calculated. The result shows that for any four-token entity, the probability of being correctly masked is less than 0.05%. As the number of entities n increases, the probability of an entity being masked decreases progressively. Therefore, as shown in Equation (1), the longer the entity is, the more difficult it becomes to extract it completely.

2.3.2. Low Classification Accuracy for Polysemy Entities

Traditional NER models based on BERT usually first look up the language model’s internal vocabulary to generate an initial static word embedding, then generate the final single semantic expression by combining context information. However, the single semantic expression cannot effectively classify polysemous entities in intelligence texts. A detailed example is shown in Figure 3.

In Figure 3, “Apple” is a polysemous entity whose semantics can refer to a device, event, or company. The goal is to accurately classify the “Apple” entity as the “device (Dev)” category. Traditional BERT-based NER models obtain semantic representations based on the BERT vocabulary training and contextual information integration. This paper counts the proportion of contextual information based on the training corpus of BERT. In the contextual information, “Apple” has four categories of contextual information “Apple employees”, “Apple company”, “Apple phone”, and “Apple Events”, respectively, with proportions of 40%, 20%, 30%, and 10%. These data indicate that the entity “Apple” has three potential categories “Org”, “Dev”, and “Eve”. However, traditional NER models based on BERT often default to generating a single semantic expression biased towards the most common context. The final model can only obtain a semantic representation of a single structure, rather than the symmetric characteristics of semantics. Then, “Apple” will be classified as the “organization (Org)” category, but the correct semantic inclination of “Apple” should be the “device (Dev)” category. As a result, a single semantic representation produced by a contextual approach alone cannot accurately identify the proper semantics of polysemy entities.

3. Method

To address the issues outlined in Section 2.3, this section introduces the details of the SSNER model.

3.1. Task Description

Given a sentence in the field of cybersecurity containing n words, it can be expressed as details in Equation (2).

S (X, Y) = \{\begin{matrix} X = [w_{1}, w_{2}, \dots, w_{k - 1}, w_{k}, w_{k + 1}, \dots, w_{n}] \\ Y = [O, O, \dots, B, I, I, . . ., O] \end{matrix}

(2)

where k represents any word token in the sentence X and n is the length of the sentence X, 1 < k < n.

In this paper, the named entity recognition (NER) task is formulated as a sequence labeling problem, where each token in the input sentence is assigned a specific label indicating its role in forming a named entity. We adopt the widely used BIO (Begin, Inside, Outside) tagging scheme to encode entity boundaries and types. Specifically, the tag B-X denotes the beginning token of an entity of type X, I-X indicates a token that is inside the same entity, and O marks a token that does not belong to any named entity. Each word token in the sentence X is represented and elements in Y correspond to the labels for each word. A set of contiguous meaningful tokens constitutes an entity, also referred to as a continuous multi-token entity, which is an entity. It is also known as a continuous multi-token entity. Therefore, the goal of this paper is to learn parameterized mapping from input words to output labels.

3.2. Overall Framework

The overall architecture of segment-level information extraction and similar semantic space construction NER (SSNER) model is shown in Figure 4.

As illustrated in Figure 4, the intelligence text first passes through the segment-level information enhancement module on the left side of the presentation layer to generate segment-level word embeddings with CTI knowledge (details are discussed in Section 3.3). Meanwhile, the input data are processed by a special CNN (CharCNN) to generate character-level word embeddings. Then, the segment-level embeddings and the character-level embeddings combine to enter the BiLSTM to form mixed feature embeddings. Next, the data pass through the semantic enhancement module on the right side of the representation layer to generate similar semantic embeddings (details are discussed in Section 3.4). Finally, the mixed feature embeddings and the similar semantic embeddings are fused by a multi-head attention mechanism to generate integrated semantic embeddings (details are discussed in Section 3.5). The resulting embeddings are fed into the entity layer, where a CRF model performs entity extraction and classification.

3.3. Segment Masking Enhancement Mechanism (SME)

The SME mechanism is designed to improve the ability of the SSNER to extract multi-token entities, as described in Section 2.3.1. This method involves three main steps: generating random contiguous segments, applying MASK intervention, and SecureBERT_SME model training. The specific details are as follows.

(1): Generate random continuous segments

Before intervening with the MASK mechanism, the SSNER first generates random continuous segments to be used as input for the pre-training BERT. The description is as follows.

Step 1: For any dataset D, the occurrence probabilities for entities of varying lengths in the texts are calculated using the statistical function S(D), as detailed in Equation (3).

S (D) = {e_{1}, e_{2}, . . . e_{k} . . ., e_{n}}

(3)

where S(·) represents the operation of calculating the distribution probability of all entity lengths and n is the length of the longest entity in dataset D. e_k is the probability of an entity of length k appearing, 1 ≤ k ≤ n.

Step 2: Following the generation of an activation threshold u in the range [0, 1), the SSNER employs the Inverse Transform Sampling (ITS) method to calculate the cumulative distribution function F(k) as defined in Equation (4). When u satisfies the condition given in Equation (5), the value of masklen can be determined accordingly.

F (k) = \sum_{j = 1}^{k} e_{j}

(4)

F (m a s k l e n - 1) < u \leq F (m a s k l e n)

(5)

where F(k) represents the cumulative probability of each data point,

1 \leq j \leq k

, and u represents a random number in the range [0, 1), which is used in Equation (5) to generate the random number masklen that satisfies the probability distribution of the entity. The masklen is the length of the segmentation and F(0) = 0.

Step 3: The input set X is recursively partitioned into segments of masklen to get the target set Z as defined in Equation (6). Then put the set Z into the set Mask List using Equation (7).

X = {w_{1}, w_{2}, \dots w_{i} . . ., w_{n}} \overset{m a s k l e n}{\to} Z = {z_{1}, z_{2}, \dots z_{i} . . ., z_{n - m a s k l e n + 1}}

(6)

M a s k L i s t = \{z_{1}, z_{2}, \dots, z_{n - m a s k l e n + 1}\}

(7)

where

w_{i}

is each token of the input text (1 ≤ i ≤ n),

z_{i}

represents the segment after the input text is partitioned (1 ≤ i ≤ n − masklen + 1), and n is the number of the input text token.

(2): MASK intervention

On the basis of the generated random segments, this paper masks the random segments with the intervention algorithm. The intervention algorithm flow is shown in Figure 5.

In Figure 5. If the Mask List contains a single character, the

z_{i}

masked in the Mask List is signed directly with the [MASK] symbol. If the Mask List contains a sequence of contiguous segments, the position of the first token in the segment is used as the starting position of this segment. Then, the masked segment in the Mask List is signed directly with the sequential [MASK] … [MASK] symbol. Finally, the algorithm retains the same strategy as the ratio of words masked in BERT.

(3): SecureBERT_SME model training

After the operation of the MASK intervention, this paper pre-trains BERT based on the CTI corpus and modifies BERT’s mask mechanism to become the SecureBERT_SME model. The description is as follows.

Step 1: Data collection

A large amount of online text data related to cybersecurity is collected, including books, blogs, news articles, security reports, journals, and survey reports. A web-crawling tool is employed to create a corpus containing 10 million words. The corpus encompasses various forms of cybersecurity texts, ranging from basic information, news articles, Wikipedia entries, and tutorials to more advanced materials such as malware analysis, intrusion detection, and vulnerability assessment, as shown in Figure 6.

Step 2: Data preprocessing

By utilizing Equation (8) and Equation (9), an amount of irrelevant information is removed, such as unrelated HTML tags and specific formatted text. The text data is tokenized using the WordPiece Equation (10), and different forms of words are normalized to ensure that the model comprehends various expressions of the same meaning.

c l e a n e d_t e x t = r e . s u b ({p a t t e r n_h t m l_t a g s}, " ", t e x t)

(8)

c l e a n e d_t e x t = r e . s u b ({p a t t e r n_n o n_a l p h a n u m e r i c}, " ", c l e a n e d_t e x t)

(9)

t o k e n s = w o r d p i e c e_t o k e n i z e r (c l e a n e d_t e x t)

(10)

where sub(·) stands for removing irrelevant elements in the text and wordpiece_tokenizer(·) represents the operation of splitting text into words or subwords.

Step 3: Fine-tuning

The traditional BERT model is used as the foundation for transfer learning. Specifically, the NER is chosen as the target task, and BERT is fine-tuned based on the SME mechanism. During the training process, we adopt the Adam (Adaptive Moment Estimation) optimizer to update model parameters [26]. Adam adaptively adjusts the learning rate for each parameter using estimates of the first and second moments of the gradients, which facilitates efficient convergence and stable optimization. Furthermore, to tackle cybersecurity dataset class imbalance, focal loss modulates cross-entropy loss to focus on hard, misclassified examples, improving minority class learning and generalization [27]. By jointly leveraging the Adam optimizer and focal loss and continuously monitoring the loss function and classification accuracy, the training process is guided toward the effective development of the SecureBERT_SME model.

Based on the three steps described above, we have developed a comprehensive algorithmic framework that elucidates the logical relationships among these steps, as shown in the following Algorithm 1.

Algorithm 1: Segment Masking Enhancement Mechanism
	Input: Pre-trained BERT model, annotated datasets
	Output: $S e c u r e B E R T_S M E$ model for cybersecurity NER
1	Initialization: Split annotated dataset into train, validation, and test sets;
2	Initialize optimizer (Adam) and loss function (focal loss);
3	Set hyperparameters (learning rate, batch size, epochs, etc.);
4	For each epoch in epochs do
5	For each batch in train set do
6	Tokenize batch text into subwords;
7	Generate input IDs, attention masks, and token-type IDs;
8	Feed input data into BERT;
9	By utilizing Equations (3)–(5);
10	Generate random continuous segments $m a s k l e n$ ;
11	Utilizing intervention algorithm to MASK intervention based on $m a s k l e n$ ;
12	Compute predictions and calculate focal loss;
13	Backpropagate loss and update model weights;
14	end
15	Evaluate model on validation set using precision, recall, and F1-score;
16	Save best-performing model checkpoint;
17	end
18	Return retrained $S e c u r e B E R T_S M E$ model

After BERT is retrained on the CTI corpus, the SecureBERT_SME model can generate segment-level information word embeddings with CTI knowledge to solve the issue of the incomplete extraction of multi-token entities. In Figure 7, this paper gives an example to illustrate the effectiveness of the SME mechanism for the extraction of multi-token entities. X = “R* Technology software company suffered ATP attack in September” is the input text and the positional embeddings of the sentence are [0, 1, 2, 3, 4, 5, 6, 7, 8]. The goal is to completely extract the entity “R* Technology software company”. By utilizing the SME mechanism, the SSNER generated contiguous segments with a mask length of masklen to extract the segment-level word embedding, so that the ability of the SSNER to recognize multi-token entity is enhanced. Furthermore, we systematically analyze the impact of different “masklen” values on the performance of the SSNER model (see Appendix A). As a result, by using the SME mechanism, the SSNER model can capture more segment information in the sentence, enhancing the ability to extract multi-token entities in intelligence texts.

3.4. Semantic Collaborative Embedding Mechanism (SCE)

The SCE mechanism is designed to address the issue of the low classification accuracy of polysemy entities, as described in Section 2.3.2. SCE involves two main steps: Filtering similar word embedding and similar semantic space construction. The specific details are as follows.

(1): Filtering similar word embedding

Before constructing the similar semantic space, SSNER first filters out the similar word embeddings for the keywords in threat intelligence text using the SSF algorithm.

Step 1: Use word2vec to train a word-embedding model on the threat intelligence datasets, generating word-embedding vectors as defined in Equation (11).

I n p u t t e x t s : X = \{w_{1}, \dots w_{i} . . ., w_{n}\} \overset{w o r d 2 v e c}{\to} O u t p u t e m b e d d i n g s : Z = \{v_{1}, . . . v_{i} \dots, v_{n}\}

(11)

Step 2: By using the contextual semantic information in the Word2Vec model, SSNER filters out the first batch of similar word set M-SET of size m (m < n)

M - S E T : W_{i} = \{w_{i 1}, \dots, w_{i m}\}

(12)

Step 3: While maintaining the accuracy of contextual semantics, a fuzzy string-matching algorithm is used to compare the internal structure of the text and filter out the final similar words set S-SET of size k (k < m) and the word vectors set

Z_{i}

, as shown in Equations (13) and (14).

S - S E T : W_{i} = \{w_{i 1}, \dots, w_{i k}\}

(13)

Z_{i} = \{v_{i 1}, \dots, v_{i j}, \dots, v_{i k}\}

(14)

Through the above steps, SSNER obtains similar words that retain both contextual similarity and structural similarity. The SSF algorithm flow is presented in Figure 8.

(2): Similar semantic space construction

Based on filtering similar word embeddings, SSNER uses the self-attention mechanism to construct a similar semantic space for input

w_{i}

. The description is as follows.

Firstly, by utilizing Equation (15), the semantic relevance weight

\emptyset_{i j}

between the target word and similar words in the S-SET set is calculated. Then, the semantic relevance weight is normalized by Equation (16). Finally, the normalized results are accumulated by Equation (17), resulting in the similar semantic space embedding

s_{i}

.

\emptyset_{i j} = c o s i n e (z_{i} * v_{i j})

(15)

\partial_{i} = \frac{e x p {(\emptyset}_{i j})}{\sum_{1}^{K} \exp (\emptyset_{i k})}

(16)

s_{i} = \sum_{i = 1}^{K} \partial_{i} * z_{i}

(17)

where

v_{i}

represents the mapping embedding of

w_{i}

,

v_{i j}

represents the mapping embedding of the jth similar word, and

k

represents the number of final similar words.

After the similar semantic space construction is complete, the SSNER will construct similar semantic spaces for the words in the intelligence texts to solve the issue of the low classification accuracy for polysemy entities. In Figure 9, this paper gives an example to illustrate the effectiveness of the SCE mechanism for the classification of polysemy entities. X = “The group used a phishing technique to obtain Apple’s backend ID” is the input text and “Apple” is a polysemy entity that can refer to a device, event, and company. The goal is to accurately categorize the “Apple” entity in the “device (Dev)” category. The SSNER model first needs to train the word2vec model and filter out M similar words with similar meanings, which include “Microsoft, iPads, Nokia, designs, softether …”. Next, a fuzzy string-matching algorithm is used to filter out the final K similar words, which are “Nokia, iPads, Microsoft, Baitshop …”. Additionally, this paper conducts experimental evaluations on the value of K, as discussed in Section 5.1. Finally, the self-attention mechanism is employed to calculate the semantic relevance weights of each similar word using Equation (15). Then, normalizing and accumulating is performed to obtain the final similar semantic space embedding

s_{i}

by Equations (16) and (17). The spatial structure of

s_{i}

has the symmetry of semantics and structure rather than a single semantic expression. In Figure 9, the similar space embedding

s_{i}

contains semantic information of the “Dev” category and “Org” category. Therefore, compared to the traditional single semantics, similar semantic space greatly improves the probability of a polysemy entity being predicted correctly.

3.5. Integrated Semantic Construction

Based on the 3.3 SME mechanism and 3.4 SCE mechanism, this section will introduce how the SSNER combines segment-level embedding and similar semantic embedding to generate integrated semantics embedding. The specific details are as follows.

(1): Mixed feature embedding construction

Before integration, segment-level word embedding and character-level word embedding are combined to generate mixed feature embedding and obtain time sequence information through BiLSTM. The description is as follows.

Step 1: SSNER obtains segment-level word embedding by using Equation (18)

e_{i}^{w} = S e c u r e B E R T_S M E (w_{i})

(18)

where

S e c u r e B E R T_S M E (w_{i})

represents the operation of calculating the segment-level word embedding from the input,

e_{i}^{w} \in R^{d w}

, and

d_{w}

represents the dimension of the segment-level embedding vector.

Step 2: Due to the lack of index records for unknown words in the vocabulary of the BERT model, SSNER becomes challenging to generate relevant word vectors. Therefore, by using a Character-Based CNN (CharCNN) in Equation (19), SSNER obtains character-level word embedding.

e_{i}^{c} = C h a r C N N (w_{i})

(19)

where CharCNN(·) represents the operation of calculating the character-level word embedding for the input

w_{i}

.

Step 3: By using Equation (20), the word-level embedding

e_{i}^{w}

and the character-level embedding

e_{i}^{c}

are concatenated, resulting in the word embedding

O_{i}

. Then, by using Equation (21),

O_{i}

enters the BiLSTM to generate the mixed word embedding

h_{i}^{o}

with time sequence information.

O_{i} = c o n c a t (e_{i}^{w}, e_{i}^{c})

(20)

h_{i}^{o} = B i L S T M ({\vec{h}}_{i + 1}^{o}, {\overset{\leftarrow}{h}}_{i - 1}^{o}, O_{i})

(21)

where concat(∙) is the vector concatenation operation,

{\vec{h}}_{i + 1}^{o}

represents the state at the next moment, and

{\overset{\leftarrow}{h}}_{i - 1}^{o}

represents the state at the previous moment.

(2): Integrated semantic construction

After mixed feature embedding construction is achieved, the multi-head attention mechanism is used to construct the integrated semantic.

By using a multi-head attention mechanism, the similarity is computed between the mixed feature word embedding

h_{i}^{o}

and the similar semantic word embedding

s_{i}

. Then, the SSNER model learns multidimensional features from different representation subspaces and concatenates the results obtained from different attention heads to yield an integrated semantic representation

H_{i}

, as shown in Equation (23), finally entering the CRF layer to predict entity categories.

A t t e n t i o n (Q_{o}, K_{o}, V_{o}) = s o t t m a x (\frac{Q_{o} {K_{o}}^{T}}{\sqrt{d_{K}}}) \cdot V_{o}

(22)

H_{i} = M u l t i A t t e n t i o n (s_{i}, h_{i}^{o})

(23)

where

Q_{o}

,

K_{o}

, and

V_{o}

are in vector form and

\frac{1}{\sqrt{d_{k}}}

is the scaling factor for adjusting the inner product of

Q_{o}

and

K_{o}

to prevent the inner product of

Q_{o}

and

K_{o}

from becoming too large.

s_{i}

is defined by Equation (17).

4. Experimental Verification and Validation

4.1. Datasets

Due to the lack of a unified dataset for cybersecurity NER tasks, this paper selects experimental datasets from two sources. The first dataset is DNRTI [28], a large-scale dataset for NER in cyber threat intelligence (CTI), consisting of 6574 annotated sentences and 36,412 entities. All entities can be classified into 13 categories, including hacker organizations (HackOrg), attacks (OffAct), sample files (SamFile), security teams (SecTeam), tools (Tool), time (time), purpose (Purp), area (Area), industry (Idus), organizations (Org), methods (Way), vulnerabilities (Exp), and features (features). This dataset is available at https://github.com/LiuPeiP-CS/NER4CTI/tree/main/DNRTI_Dataset (accessed on 28 March 2025). The second dataset is an open-source dataset provided by Bridges et al. [29], which collects data from multiple platforms in the cybersecurity domain, including Microsoft Security Bulletins (MSB), Metasploit, and the National Vulnerability Database (NVD). The Bridges dataset contains various entities, such as applications, vendors, operating systems, and related terms. This dataset is available at https://github.com/stucco/auto-labeled-corpus (accessed on 28 March 2025).

Both datasets are annotated using a consistent entity labeling scheme based on the BIO tagging format (Begin, Inside, Outside). Specifically, the tag B-X denotes the beginning token of an entity of type X, I-X indicates a token that is inside the same entity, and O marks a token that does not belong to any named entity. In DNRTI, each entity category is associated with a distinct label, and the boundaries of multi-token entities are explicitly marked. The annotation for Bridges et al. follows a similar structure but with a focus on more domain-specific entities relevant to cybersecurity.

To approximate the training and testing data distribution, this paper randomly selects 80% of the original text as the training set, 10% as the validation set, and 10% as the test set for both datasets. The statistical information for each dataset and the details of each entity are presented in Table 1, Table 2 and Table 3, respectively.

4.2. Settings

In the experiments of this paper, the word-embedding dimensions of Word2Vec and BERT are set to 100 and 768, respectively. The number of similar words is set to 4, and the character embedding dimension and hidden size are set to 50 and 1024, respectively. The batch size is set to 128, and the maximum sentence length is set to 258 words. SSNER uses the Adam optimizer to optimize all parameters in the model training process. The learning rate is set to 1 × 10⁻⁴. In addition, the abandonment rate is set to 0.5 to avoid overfitting. This paper selects the best model based on the results of the validation dataset and applies it to the test dataset. Finally, the key parameters are summarized in Table 4.

4.3. Baselines

In this section, SSNER compares the proposed model with several baselines.

CNN+BiLSTM+CRF was proposed by Qin et al. This model used the architecture of bidirectional convolutional neural networks (CNN), long short-term memory networks (BiLSTM), and conditional random field (CRF) to address the NER task in the field of network security. The CNN is employed to extract the character-level features of entities, the BiLSTM layer is responsible for extracting contextual features from the input embedding, and the CRF layer decodes the sequence and generates output tokens [17].
IDCNN+CRF was proposed by Yu et al. This model used an architecture combining IDCNN and CRF to capture contextual information. They proposed an innovative convolutional network module that effectively aggregates multi-scale contextual information through dilated convolution to improve the performance of semantic segmentation tasks [30].
BERT+BiLSTM+CRF was proposed by Zhou et al. This model used the architecture of the BERT model, BiLSTM, and CRF to handle the NER task in the field of network security. Unlike the previous baselines, the text of the input layer is converted into word embeddings through BERT at the input layer. In this way, word embedding has richer temporal information. Then, the local features of the text are obtained through BiLSTM and the CRF layer generates output tokens [22].
DKNER, which is a cybersecurity named entity recognition model driven by data and knowledge, was proposed by Gao et al. At the input layer of the model, they used an external dictionary as an auxiliary knowledge base to enhance the ability to represent words. In addition, they incorporated a self-attention mechanism to capture the inherent dependencies between words within a sentence [19].
NER4CTI was proposed by Liu et al. The NER4CTI focuses on entity extraction in cybersecurity threat intelligence. They proposed a semantic enhancement method that integrates multiple language features to enhance the representation ability of input tokens and improve the performance of named entity recognition [21].
FIEBD, which is a new feature integration and entity boundary detection model, was proposed by Wang et al. They employed a new pre-trained language model, PERT, to obtain word embeddings for web text and developed a novel neural network unit, GARU, that combined different features extracted by graph neural networks and recurrent neural networks. In addition, they introduced an entity boundary detection module to perform entity head and tail prediction as an enhancement task [25].

4.4. Experimental Results and Analysis

4.4.1. Comparison Experiment

As shown in Table 5, among the baselines for the DNRTI dataset, the CNN+BiLSTM+CRF model performs the worst with an F1-score of 75.01%, as it only utilizes the CNN, BiLSTM, and CRF modules. Although the CNN captures the character-level features of words and the sequential encoder BiLSTM captures long-distance information before passing the hidden representations to the CRF layer, relying solely on these modules is insufficient for fully learning sentence dependencies. Furthermore, the F1-score of the NER4CTI model is 85.34%, which is 10.33% higher than the CNN+BiLSTM+CRF model. This model primarily proposes a semantic enhancement method that integrates various linguistic features to improve the representation of input tokens and enhance NER performance. This method shows the importance of semantic enhancement in the NER task of threat intelligence text. However, the F1 of the SSNER model is 86.12, which is 0.78% higher than the NER4CTI model. This is due to the fact that compared with the semantic enhancement of the NER4CTI model, the SSNER model also solves the problem of the incomplete extraction of multi-token entities. While enhancing the semantics of intelligence text, SSNER also enhances the ability to extract multi-token entities. Although the precision of the SSNER model is similar to that of the NER4CTI model, its recall is 1.94% higher than that of the NER4CTI model. This indicates that the proportion of entities correctly identified by the SSNER model is more significant than that of the NER4CTI model.

As shown in Table 6, among the baselines for the Bridges et al. dataset, the F1-score of the CNN+BiLSTM+CRF model is 83.37%. It has the worst performance as it relies on only CNN, BiLSTM, and CRF modules. The BERT+BiLSTM+CRF proposed by Zhou et al. has a 7.99% higher F1-score than the worst models because it uses BERT to obtain richer semantic word embeddings before passing to the CRF. However, The DKNER model proposed by Gao et al. has a 2.56% higher F1-score than BERT+BiLSTM+CRF, mainly by introducing an external dictionary as supplementary knowledge for word embedding in the input layer. This shows that the introduction of external threat intelligence knowledge makes the generated word embedding more closely related to the characteristics of threat intelligence texts. Moreover, the FIEBD model proposed by Wang et al. is 0.94% higher than the DKNER model in F1-score. Compared with the DKNER model, FIEBD used the pre-trained language model PERT to obtain better word vectors and added an LSTM-based entity boundary detection module to improve the ability to detect multi-token entity boundaries. This shows that compared with traditional NER models, large language models (LLMs) contribute more significantly to model performance on the NER tasks and the multi-token entities detection also affects the extraction efficiency of the model. However, the SSNER model is 1.46% higher than the FIEBD models in the F-score. This is because the SSNER model, in addition to utilizing the LLMs model, also designs the semantic enhancement module to address the problem of the low classification accuracy for polysemy entities. Similarly, the retrained SSNER model SecureBERT_SME is 3.20% and 3.54% higher than the standard BERT language model in precision and F1-score on the Bridges et al. datasets. This also shows that the segment-level word embeddings with the CTI knowledge background generated by the SecureBERT_SME model are more effective than the word embeddings generated by the ordinary BERT model.

Table 7 and Table 8 show the results of the SSNER model for entity type prediction on the two datasets and the percentage of entities composed of more than two tokens. Specifically, SSNER performs well in predicting four entity types in the DNRTI dataset: Features, Purp, Time, and Way. Among the entities belonging to the category of Way, the proportion of multi-token entities is about 70% and the F1-score is 0.95. Similarly, among the entities belonging to the category of Features, the proportion of multi-token entities is about 72% and the F1-score is 0.97. This result shows that SSNER demonstrates strong performance in extracting multi-token entities. Furthermore, in the Bridges et al. dataset, SSNER excels in predicting four entity types: Relevant_ term, Version, Function, and Application. Among the entities belonging to the category of Relevant_ term, the proportion of multi-token entities is about 48% and the F1-score is 0.98. Moreover, for the programming category, although it is a single-token entity, the extraction efficiency F1-score is 0.98. This result indicates that SSNER excels not only in extracting multi-token entities but also in extracting single-token entities.

4.4.2. Ablation Study

To validate the rationality of the components in the SSNER model, this section conducts multiple ablation experiments. The results are shown in Table 9.

Specifically, the SSNER achieves an F1-score of 86.12% on the DNRTI dataset and 96.02% on the Bridges dataset. The performance of the SSNER model significantly declines after removing the Segmented Masking Enhancement (SME) mechanism. The F1-score decreases by 2.88% in the DNRTI dataset and by 3.40% in the Bridges dataset. This result demonstrates the effectiveness of the retrained SecureBERT_SME with the SME mechanism in extracting multi-token entities from intelligence texts.

Similarly, the experimental data indicate that when the Semantic Collaborative Embedding (SCE) component is removed, the evaluation metrics also drop sharply. The F1-score decreases by 1.60% in the DNRTI dataset and by 2.90% in the Bridges dataset. This indicates that integrated semantic embedding enhances the recognition efficiency of polysemy entities and expands the learning of word fuzziness from complex cybersecurity descriptions.

In summary, the significant performance gap between the SSNER model and the worst ablation model in Table 9 highlights the efficacy of the different components.

5. Experimental Discussion

5.1. Analysis of the Number of Similar Words

In Section 3.4, it is explained that the challenge of extracting the meanings of polysemous entities has been addressed by creating similar semantic spaces. However, the experiments conducted in this paper reveal that the number of similar words significantly impacts the results. This section aims to establish the optimal number of similar words through comparative experiments.

The results are shown in Figure 10. SSNER conducted experiments on the DNRTI dataset, and as the number of similar words increased, the experimental results showed an upward and then downward trend. This phenomenon reveals that while the introduction of similar word embeddings enhances word recognition ability, more similar word embeddings will not perform better in extraction efficiency. Especially regarding the changes in the F1-score, when the similar word number is 2, the F1-score is 81.75%. When the similar word number is 4, the F1-score reaches its highest value of 82.31%. However, when the number exceeds 4, the F1-score shows a downward trend, and when the similar word number is 6, the F1-score reaches its lowest value of 81.19%. This indicates that the number of similar word embeddings affects the emergence of optimal performance.

Additionally, this paper employs PCA to reduce the dimensionality of semantically similar word embeddings to a two-dimensional space and a three-dimensional space, as shown in Figure 11. This analysis reveals that words such as “Nokia (Dev)”, “iPad (Dev)”, “Microsoft (Org)”, and “Baitshop (Org)” are located near “Apple” in this three-dimensional space. The mapping space with four similar words is denser than the mapping space with five similar words. Moreover, when the number is 4, the meanings of the four similar words are closer to Apple. Therefore, the model establishes the number of similar words at four for constructing the semantic space.

5.2. Analysis of the SecureBERT_SME Model Performance

Figure 12a,b show the training process based on SecureBERT_SME and the ordinary BERT model on the validation set of DNRTI. Around the 10th epoch, the SecureBERT_SME model reaches convergence. Furthermore, after the 10th epoch, the F1-score of SecureBERT_SME remains at 83% and the score fluctuation is small. However, the training of ordinary BERT is completed at the 40th epoch and the F1-score is less than 80%. As a result, compared to the ordinary BERT model, the SecureBERT_SME model shows more stability after convergence and has an excellent entity extraction effect.

Meanwhile, Figure 13a,b show the training process based on SecureBERT_SME and the ordinary BERT model on the validation set of Bridges et al. Around the 10th epoch, the SecureBERT_SME and BERT reach convergence. But, compared with the traditional BERT model, the precision and F1-score of the SecureBERT_SME model are higher than 90%.

5.3. Case Study

To obtain deeper insights into the performance of the SSNER model across different types of entities, this paper selects threat intelligence examples from dataset DNRTI for analysis. The following are two representative cases that illustrate the model’s performance in various contexts.

As shown in Table 10, in case 1, in the sentence “The Trojan.Miner version 4.1 was detected on the victim machine by hash bursting.”, the base BERT model misses the information on the start marker. With the use of the SME module, the SSNER model captures segment-level information in the text, thus providing a better understanding of the long entities. In addition, after the introduction of the SCE mechanism, the SSNER model can extract “hash bursting” now, which indicates that the model is constructing a similar semantic space through the SSF algorithm so that the word embedding has more similar semantic features.

As shown in Table 11, in case 2, in the sentence “APT34, also known as OIL-RIG, is suspected of carrying out the cyberattack”, the entity “APT34” compared to the base BERT model can be extracted well, but for the “OIL-RIG” entity, the SSNER model is difficult to handle. That is because when the model handles entities with abbreviations or spelling variants, if the context is not clear, the model may recognize them as common nouns or terms, especially in the use of some specific symbols.

Furthermore, this paper analyzes the entity types with the worst F1-scores in Table 7 and Table 8. In the DNRTI dataset, entities belonging to the categories “SecTeam” and “Org” were extracted least effectively. For example, entities such as multi-token names (e.g., “Los Angeles Technology software company”) tend to have lower recall scores due to their complexity and segmentation challenges. In the dataset from Bridges et al., entities belonging to the categories “Edition” and “Hardware” were extracted least effectively. For example, rare or infrequently occurring entities related to specific cybersecurity vulnerabilities (e.g., “CVE-2021-3156”) often yield lower precision, as the model might not encounter them frequently enough during training to learn their patterns effectively.

To provide a more comprehensive error analysis, we categorize model errors based on entity types:

(1): Entities with Excessive Length: Unusually long entities tend to reduce the model’s performance. The extended length of these entities often makes it harder for the model to properly segment and label the entire entity, leading to incomplete extractions. Such entities might be truncated or misclassified, especially when they exceed a certain token limit. This results in lower recall and precision for entities of this nature.
Examples: “APT org for multi-source data fusion”, “National Cybersecurity Threat Intelligence Report”.

(2): Entities with Abbreviations or Spelling Variations: Entities that include abbreviations or have multiple spelling variants introduce ambiguity in entity recognition. The model may struggle to correctly identify these entities, particularly if abbreviations or variations are not consistently present in the training data. This results in misclassifications, as the model might incorrectly treat one form of the entity as a different entity altogether, leading to both false positives and false negatives.
Examples: “CVE-2021-3156”, “OIL-RIG”, and “MSB” (Microsoft Security Bulletin).

(3): New or Rare Entities: Newer entities that emerge infrequently in the training data, such as newly discovered vulnerabilities, attack techniques, or cybersecurity tools, often pose challenges for the model. These entities are typically not present in the model’s training set, and as a result, they may be misclassified or go undetected entirely (false negatives). The model’s generalization capability is especially limited when faced with such new entities, as it lacks the necessary context or prior examples to correctly identify them.
Examples: New vulnerabilities like “CVE-2025-3652” or recent attack methods such as “fileless malware techniques”.

6. Conclusions

This study proposes the SSNER model, leveraging the Segment Masking Enhancement (SME) and Semantic Collaborative Embedding (SCE) mechanisms to improve entity recognition in cybersecurity. The results demonstrate that these mechanisms significantly enhance the extraction efficiency of long entities and enrich the semantics of polysemous entities in cybersecurity texts. However, the proposed approach has some limitations, such as its dependency on the availability of high-quality CTI datasets and its performance on shorter entities or certain complex threat contexts.

In future work, we will explore the integration of few-shot and unsupervised learning techniques to mitigate the scarcity of labeled data in cybersecurity. Moreover, we aim to extend the SSNER model to multilingual and multi-domain contexts and enhance as well as improve its capability to process long or cross-document entities.

Author Contributions

Methodology, L.C.; software, J.Z.; validation, L.C. and J.Z.; formal analysis, R.J. and B.Z.; writing—original draft preparation, L.C.; writing—review and editing, L.C. and H.D.; project administration, R.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Province Science and Technology Support Program, grant number No.2024ZYD0272.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

Thank you to everyone who has contributed to this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

As shown in Figure A1, this paper systematically analyzes the effect of different masklen lengths on model performance. P1 is the distribution probability of all entity lengths in intelligence text, which is calculated by Equation (3). P2 represents the probability that the “R* Technology software company” entity is masked, which is calculated by Equation (1). Effect metrics are used to measure the effective degree of different lengths of masklen in extracting entities from the SSNER model with the value range of 0–1, and the larger the value, the better the extraction effect.

Figure A1. MASK situation analysis.

Figure A1 shows three situations as follows.

Situation A (no mask intervention): Without mask intervention, the probability of the “R* Technology software company” entity being masked is

{0.15}^{4}

. Therefore, the model is extremely inefficient in extracting such entities.

Situation B (masklen = 2): With mask intervention, when masklen = 2, the probability of the “R* Technology software company” entity being masked is about

0.231 \times {0.15}^{3}

. Although the probability increases slightly, the entity extraction efficiency of the SSNER model is improved. This situation shows that mask interventions can enhance the model’s learning ability to extract continuous multi-token entities.

Situation C (masklen = 4): With mask intervention, when masklen = 4, the probability of the “R* Technology software company” entity being masked is about 0.129 × 0.15. Compared to the previous two situations, the probability of the entity being masked is much higher and the effectiveness is significantly improved.

Situation D (masklen > 4): With mask intervention, when masklen > 4. Although the “R* Technology software company” entity is over-extracted, the overall extraction performance of SSNER surpasses that of Situation A.

References

Por, L.Y.; Dai, Z.; Leem, S.J.; Chen, Y.; Yang, J.; Binbeshr, F.; Phan, K.Y.; Ku, C.S. A Systematic Literature Review on the Methods and Challenges in Detecting Zero-Day Attacks: Insights from the Recent CrowdStrike Incident. IEEE Access 2024, 12, 144150–144163. [Google Scholar] [CrossRef]
Wilhoit, K.; Opacki, J. Operationalizing Threat Intelligence: A Guide to Developing and Operationalizing Cyber Threat Intelligence Programs; Packt Publishing Ltd.: Birmingham, UK, 2022. [Google Scholar]
Curran, J.R.; Clark, S. Language independent NER using a maximum entropy tagger. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May 2003; pp. 164–167. [Google Scholar]
Arikkat, D.R.; Vinod, P.; KA, R.R.; Nicolazzo, S.; Nocera, A.; Conti, M. Relation Extraction Techniques in Cyber Threat Intelligence. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Turin, Italy, 25–27 June 2024; pp. 348–363. [Google Scholar]
Peng, W.; Ding, J.; Wang, W.; Cui, L.; Cai, W.; Hao, Z.; Yun, X. CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization. arXiv 2024, arXiv:2408.06576. [Google Scholar]
Jones, C.L.; Bridges, R.A.; Huffer, K.M.; Goodall, J.R. Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, Oak Ridge, TN, USA, 7–9 April 2015; pp. 1–4. [Google Scholar]
McNeil, N.; Bridges, R.A.; Iannacone, M.D.; Czejdo, B.; Perez, N.; Goodall, J.R. Pace: Pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts. In Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA, 4–7 December 2013; Volume 2, pp. 60–65. [Google Scholar]
More, S.; Matthews, M.; Joshi, A.; Finin, T. A knowledge-based approach to intrusion detection modeling. In Proceedings of the 2012 IEEE Symposium on Security and Privacy Workshops, San Francisco, CA, USA, 24–25 May 2012; pp. 75–81. [Google Scholar]
Perera, I.; Hwang, J.; Bayas, K.; Dorr, B.; Wilks, Y. Cyberattack prediction through public text analysis and mini-theories. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3001–3010. [Google Scholar]
Dong, Y.; Guo, W.; Chen, Y.; Xing, X.; Zhang, Y.; Wang, G. Towards the detection of inconsistencies in public security vulnerability reports. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 869–885. [Google Scholar]
Huang, L.; Xiao, X. CTIKG: LLM-Powered Knowledge Graph Construction from Cyber Threat Intelligence. In Proceedings of the First Conference on Language Modeling, Montreal, QC, Canada, 7–10 October 2025. [Google Scholar]
Zhou, Y.; Tang, Y.; Yi, M.; Xi, C.; Lu, H. CTI view: APT threat intelligence analysis system. Secur. Commun. Netw. 2022, 2022, 9875199. [Google Scholar] [CrossRef]
Kristiansen, L.M.; Agarwal, V.; Franke, K.; Shah, R.S. Cti-twitter: Gathering cyber threat intelligence from twitter using integrated supervised and unsupervised learning. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2299–2308. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Husari, G.; Al-Shaer, E.; Ahmed, M.; Chu, B.; Niu, X. Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources. In Proceedings of the 33rd Annual Computer Security Applications Conference, Orlando, FL, USA, 4–8 December 2017; pp. 103–115. [Google Scholar]
Kim, E.; Kim, K.; Shin, D.; Jin, B.; Kim, H. CyTIME: Cyber Threat Intelligence ManagEment framework for automatically generating security rules. In Proceedings of the 13th International Conference on Future Internet Technologies, Seoul, Republic of Korea, 20–22 June 2018; pp. 1–5. [Google Scholar]
Qin, Y.; Shen, G.W.; Zhao, W.B.; Chen, Y.P.; Yu, M.; Jin, X. A network security entity recognition method based on feature template and CNN-BiLSTM-CRF. Front. Inf. Technol. Electron. Eng. 2019, 20, 872–884. [Google Scholar] [CrossRef]
Piplai, A.; Mittal, S.; Joshi, A.; Finin, T.; Holt, J.; Zak, R. Creating cybersecurity knowledge graphs from malware after action reports. IEEE Access 2020, 8, 211691–211703. [Google Scholar] [CrossRef]
Gao, C.; Zhang, X.; Liu, H. Data and knowledge-driven named entity recognition for cyber security. Cybersecurity 2021, 4, 9. [Google Scholar] [CrossRef]
Alves, F.; Bettini, A.; Ferreira, P.M.; Bessani, A. Processing tweets for cybersecurity threat awareness. Inf. Syst. 2021, 95, 101586. [Google Scholar] [CrossRef]
Liu, P.; Li, H.; Wang, Z.; Liu, J.; Ren, Y.; Zhu, H. Multi features based semantic augmentation networks for named entity recognition in threat intelligence. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 1557–1563. [Google Scholar]
Zhou, S.; Liu, J.; Zhong, X.; Zhao, W. Named entity recognition using BERT with whole world masking in cybersecurity domain. In Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), Xiamen, China, 5–8 March 2021; pp. 316–320. [Google Scholar]
Aghaei, E.; Niu, X.; Shadid, W.; Al-Shaer, E. Securebert: A domain-specific language model for cybersecurity. In Proceedings of the International Conference on Security and Privacy in Communication Systems, Virtual, 17–19 October 2022; pp. 39–56. [Google Scholar]
Yan, J.; Du, Z.; Li, J.; Yang, S.; Li, J.; Li, J. A Threat Intelligence Analysis Method Based on Feature Weighting and BERT-BiGRU for Industrial Internet of Things. Secur. Commun. Netw. 2022, 2022, 7729456. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. A novel feature integration and entity boundary detection for named entity recognition in cybersecurity. Knowl.-Based Syst. 2023, 260, 110114. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, X.; Liu, X.; Ao, S.; Li, N.; Zhang, X. Dnrti: A large-scale dataset for named entity recognition in threat intelligence. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December 2020–1 January 2021. [Google Scholar]
Bridges, R.A.; Jones, C.L.; Iannacone, M.D.; Testa, K.M.; Goodall, J.R. Automatic labeling for entity extraction in cyber security. arXiv 2013, arXiv:1308.4941. [Google Scholar]
Yu, B.; Wei, J. IDCNN-CRF-based domain named entity recognition method. In Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Weihai, China, 14–16 October 2020; pp. 542–546. [Google Scholar]

Figure 1. Traditional MASK masking strategies.

Figure 2. Example of case.

Figure 3. Traditional models for processing polysemy entities.

Figure 4. Architecture of the proposed model.

Figure 5. Intervention algorithm.

Figure 6. The resources collected for cybersecurity textual data.

Figure 7. Example of MASK intervention mechanism.

Figure 8. The SSF algorithm.

Figure 9. An example of constructing a similar semantic space.

Figure 10. Performance of different numbers of similar word embeddings.

Figure 11. Clustering of similar words to “Apple”.

Figure 12. (a) The training process based on SecureBERT_SME on the DNRTI dataset. (b) The training process based on BERT on the DNRTI dataset.

Figure 13. (a) The training process based on SecureBERT_SME on the Bridges dataset. (b) The training process based on BERT on the Bridges dataset.

Table 1. Experimental dataset statistics.

Dataset		Sentences	Tokens	Entities
DNRTI	Train	5251	145,597	29,913
	Dev	662	18,270	3292
	Test	663	18,379	3606
Bridges	Train	10,963	551,884	99,792
	Dev	1565	78,756	14,108
	Test	3131	156,597	28,137

Table 2. Statistics of each entity in the DNRTI dataset.

Entity	Train	Dev	Test
HackOrg	4396	567	502
OffAct	2263	165	241
SamFile	1777	280	343
Sec Team	1470	243	208
Tool	3835	473	476
Time	2122	269	268
Purp	1952	212	260
Area	2987	192	268
Idus	1831	140	165
Org	2071	180	238
Way	1613	216	189
Exp	1606	172	181
Features	1990	184	267

Table 3. Statistics of each entity in the dataset collected by Bridges et al.

Entity	Train	Dev	Test
Application	13,117	1941	3823
Vendor	7517	1116	1935
Version	20,939	3129	5734
Relevant term	49,603	6882	14,238
Function	842	121	288
Os	2795	309	674
Update	2950	311	816
Edition	390	43	133
Hardware	324	67	67
File	687	99	225
Parameter	422	57	142
Programming	112	24	30
Method	94	9	32

Table 4. Statistics of key parameters.

Parameter	Setting
Similar word-embedding size (Word2Vec)	100
Word-embedding size (BERT/SecureBERT_SME)	768
Char CNN hidden size	50
Hidden size	1024
Dropout	0.5
Batch size	128
Optimizer	Adam
Learning rate	1 × 10⁻⁵
Epoch	150

Table 5. Key comparison results for the DNRTI dataset.

Methods	Precision	Recall	F1-Score
CNN+BiLSTM+CRF [17]	75.17	75.08	75.01
IDCNN+CRF [30]	74.42	77.40	75.88
BERT+BiLSTM+CRF [22]	82.50	82.26	82.04
NER4CTI [21]	86.16	84.54	85.34
${B E R T f i n e t u n n i n g}_{o u r s}$	80.11	79.84	78.92
${S e c u r e B E R T_S M E f i n e t u n n i n g}_{o u r s}$	83.21	82.20	81.22
${S S N E R M o d e l}_{o u r s}$	86.86	86.48	86.12

Table 6. Key comparison results for the Bridges et al. dataset.

Methods	Precision (%)	Recall (%)	F1-Score (%)
CNN+BiLSTM+CRF [17]	85.16	80.70	83.37
BERT+BiLSTM+CRF [22]	91.94	90.79	91.36
DKNER [19]	94.14	93.69	93.92
FIEBD [25]	94.80	94.32	94.56
${B E R T f i n e t u n n i n g}_{o u r s}$	90.51	90.14	89.78
${S e c u r e B E R T_S M E f i n e t u n n i n g}_{o u r s}$	93.71	93.51	93.32
${S S N E R M o d e l}_{o u r s}$	96.62	96.31	96.02

Table 7. Predictions for each entity in the DNRTI dataset.

Entity	Precision	Recall	F1-Score	Support	Percentage (%)
Area	0.85	0.88	0.86	216	22%
Exp	0.93	0.94	0.93	132	23%
Features	0.94	1.00	0.97	116	72%
HackOrg	0.77	0.74	0.76	150	33%
Idus	0.86	0.93	0.90	129	22%
OffAct	0.85	0.71	0.77	150	46%
Org	0.77	0.70	0.74	137	42%
Purp	0.91	1.00	0.95	115	74%
SamFile	0.87	0.87	0.87	248	29%
SecTeam	0.78	0.64	0.70	152	31%
Time	0.93	0.95	0.94	169	47%
Tool	0.80	0.69	0.74	312	38%
Way	0.95	0.96	0.95	98	70%

Table 8. Predictions for each entity in the Bridges et al. dataset.

Entity	Precision	Recall	F1-Score	Support	Percentage (%)
Application	0.90	0.89	0.90	3923	36%
Edition	0.76	0.31	0.44	133	3%
File	0.90	0.47	0.62	225	0%
Function	0.92	0.97	0.95	288	0%
Hardware	0.36	0.33	0.34	67	52%
Method	0.85	0.95	0.90	32	0%
Os	0.88	0.89	0.89	674	59%
Parameter	0.68	0.58	0.62	142	0%
Programming	0.97	0.99	0.98	30	0%
Relevant_term	0.99	0.98	0.98	14,238	48%
Update	0.83	0.88	0.85	816	6%
Vendor	0.81	0.80	0.81	1935	6%
Version	0.94	0.90	0.94	5734	54%

Table 9. Ablation experiment.

Methods	DNRTI			Bridges et al.
Methods	P (%)	R (%)	F1(%)	P (%)	R (%)	F1(%)
SSNER	86.86	86.48	86.12	96.62	96.31	96.02
-SCE	85.16	84.83	84.52	94.23	93.67	93.12
-SME	83.98	84.04	84.11	92.73	92.67	92.62
-SCE-SME	82.11	81.66	81.23	90.89	90.52	90.17
Base model	80.11	79.84	78.92	89.71	89.11	88.53

“-” means to delete related components.

Table 10. NER results as an example. The underline marks the labels of entity tokens and the red color denotes the wrong predicted labels.

Case 1: “The Trojan.Miner Version 4.1 Was Detected on the Victim Machine by Hash Bursting”.
label	O B-Tool I-Tool I-Tool O O O O O O O B-way I-way
SSNER	O B-Tool I-Tool I-Tool O O O O O O B-way I-way
SSNER-SME	O O B-Tool I-Tool O O O O O O B-way I-way
SSNER-SCE	O B-Tool I-Tool-Tool O O O O O O O B-way
base BERT model	O O O O O O O O O O O O

“-” means to delete related components and red font means that the prediction result is wrong.

Table 11. NER results as an example. The underline marks the labels of entity tokens and the red color denotes the wrong predicted labels.

Case 2: “APT34, also Known as OIL-RIG, Is Suspected of Carrying out the Cyberattack”.
label	B-HackOrg O O O O B-HackOrg O O O O O O O O O
SSNER	B-HackOrg O O O B-HackOrg I-HackOrg I-HackOrg O O O O O O O O
SSNER-SME	B-HackOrg O O O O O O O O O O O O O O
SSNER-SCE	B-HackOrg B-HackOrg O O O O O O O O O O O O O
base BERT model	O O O O O O O O O O O O O O B-HackOrg

“-” means to delete related components and red font means that the prediction result is wrong.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Deng, H.; Zhang, J.; Zheng, B.; Jiang, R. Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction. Symmetry 2025, 17, 783. https://doi.org/10.3390/sym17050783

AMA Style

Chen L, Deng H, Zhang J, Zheng B, Jiang R. Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction. Symmetry. 2025; 17(5):783. https://doi.org/10.3390/sym17050783

Chicago/Turabian Style

Chen, Long, Hongli Deng, Jun Zhang, Bochuan Zheng, and Rui Jiang. 2025. "Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction" Symmetry 17, no. 5: 783. https://doi.org/10.3390/sym17050783

APA Style

Chen, L., Deng, H., Zhang, J., Zheng, B., & Jiang, R. (2025). Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction. Symmetry, 17(5), 783. https://doi.org/10.3390/sym17050783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Threat Intelligence Named Entity Recognition Based on Segment-Level Information Extraction and Similar Semantic Space Construction

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning Methods for NER in Cybersecurity

2.2. Large Language Model Methods for NER in Cybersecurity

2.3. Limitations of Traditional NER Methods

2.3.1. Incomplete Extraction of Multi-Token Entities

2.3.2. Low Classification Accuracy for Polysemy Entities

3. Method

3.1. Task Description

3.2. Overall Framework

3.3. Segment Masking Enhancement Mechanism (SME)

3.4. Semantic Collaborative Embedding Mechanism (SCE)

3.5. Integrated Semantic Construction

4. Experimental Verification and Validation

4.1. Datasets

4.2. Settings

4.3. Baselines

4.4. Experimental Results and Analysis

4.4.1. Comparison Experiment

4.4.2. Ablation Study

5. Experimental Discussion

5.1. Analysis of the Number of Similar Words

5.2. Analysis of the SecureBERT_SME Model Performance

5.3. Case Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI