1. Introduction
With the rapid development of internet technology, cybersecurity threats have become increasingly complex and diverse. Advanced Persistent Threats (APTs), ransomware, distributed denial-of-service (DDoS) attacks, and other emerging attack types pose serious risks to the critical infrastructures of enterprises and nations [
1,
2]. In this context, cyber threat intelligence (CTI) has been recognized as a key component of cyber defense and capability enhancement [
3]. CTI is intended to provide timely and actionable strategies by collecting, analyzing, and sharing dynamic threat information. Nevertheless, the accurate mining and analysis of high-value intelligence from large volumes of CTI reports remain significant challenges for cybersecurity professionals [
4].
The core of CTI knowledge mining lies in the accurate extraction and analysis of extensive information. This information includes attacker behaviors, vulnerabilities and exploits, defender actions, attack tools, and the organizational relationships behind adversaries. The effective management of such intelligence is essential for informed defense decision-making [
5]. Knowledge graphs describe concepts, entities, and inter-entity relations in a structured form that aligns more closely with human cognition [
6]. By leveraging knowledge graph-based information extraction techniques, large volumes of threat intelligence can be organized and interpreted more effectively. The primary tasks of information extraction are Named Entity Recognition (NER) and relation extraction (RE), whose objective is to identify entities and the relations expressed among them in text.
In the early stage of development, entity and relation extraction were primarily based on expert-crafted pattern-matching rules [
7] and statistical machine learning methods [
8]. However, these approaches depended on hand-engineered rules and features, which limited domain adaptability. With advances in artificial intelligence, rapid developments in Natural Language Processing (NLP) and deep learning have substantially advanced entity and relation extraction. Pipeline architectures were adopted for Named Entity Recognition (NER). In particular, deep models combining Recurrent Neural Networks (RNNs) with Conditional Random Fields (CRFs) were proposed [
9]. These models captured contextual dependencies more effectively and improved NER accuracy.
After NER, attention-based Graph Neural Network (GNN) models were introduced for relation extraction [
10]. The full dependency tree was used as input. Soft pruning was applied to automatically learn and retain informative substructures relevant to the relation extraction task. However, pipeline extraction suffers from error propagation: mistakes in upstream NER directly degrade downstream relation extraction accuracy.
To mitigate error propagation, information extraction has recently been modeled as a sequence-to-sequence (Seq2Seq) joint entity–relation extraction problem [
11,
12]. End-to-end attention mechanisms were introduced to capture word-level dependencies. Outputs were generated as sequences of tokens. However, performance degraded on complex sentences and long documents, and triples with overlapping entities were handled poorly.
To address the shortcomings of prior methods, a joint entity–relation extraction approach for cyber threat intelligence is proposed. The task is reformulated as an ensemble-prediction problem within a parallel framework. A joint network that combines BERT (Bidirectional Encoder Representations from Transformers) [
13] and BiGRU (Bidirectional Gated Recurrent Unit) [
14] is constructed to capture deep contextual features in sentences. An ensemble prediction module and an entity–relation triple representation are designed for joint extraction. A non-autoregressive decoder is employed to generate sets of relation triples in parallel, thereby avoiding unnecessary ordering constraints during sequence generation.
The main contributions of this work are summarized below:
A joint entity–relation extraction model for CTI is proposed. CTI triples are generated in parallel to mitigate the sequential-dependency issue inherent to sequence-labeling methods.
A cost-effective CTI dataset, SecCti, is constructed. ChatGPT’s small-sample capability is leveraged via Q&A prompts for data labeling and augmentation, and all annotations are verified by security professionals, substantially reducing manual labeling costs.
Improved robustness to overlapping-entity triples is demonstrated. An absolute F1 score of 89.1% is achieved in the corresponding evaluations.
2. Related Work
Information extraction from CTI reports primarily involves Named Entity Recognition (NER) and relation extraction (RE). Deep learning–based approaches are typically categorized into pipeline extraction and joint extraction.
In pipeline extraction, NER is performed first, followed by RE based on the recognized entities. The performance of RE is therefore highly dependent on NER quality. Gao et al. [
15] proposed an NER model that combines a data-driven deep learning approach with a knowledge-driven dictionary method to construct dictionary features, improving the recognition of rare entities and complex words. Wang et al. [
16] developed a neural unit, GARU, to integrate features from Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs), alleviating fuzzy interactions between heterogeneous networks. To address data sparsity in CTI, Liu et al. [
17] proposed a semantic enhancement method that encodes and aggregates compositional, morphological, and lexical features of input tokens to retrieve semantically similar words from a cybersecurity corpus. Sarhan et al. [
18] devised a CTI knowledge-graph framework in which NER models are first trained to recognize entities, and relation triples generated by Open Information Extraction (OIE) are subsequently labeled. Although these methods perform well on standalone NER, RE performance remains constrained by NER quality, leading to error propagation.
Joint extraction aims to identify both entities and relations directly from CTI text, improving accuracy and efficiency in cyber threat monitoring and response. Because relation triples within a sentence may share overlapping entities, several joint models have been proposed. Yuan et al. [
19] introduced a Relation-Specific Attention Network (RSAN) to jointly extract relations in sentences. RSAN combines relation-based attention with a gating mechanism over a Bidirectional Long Short-Term Memory (BiLSTM) network to produce relation-guided sentence representations, reducing redundant operations observed in pipeline methods. Building on relation-based attention, Guo et al. [
11] formulated CTI entity–relation joint extraction as a multi-sequence labeling problem. Text features were extracted with a Bidirectional Gated Recurrent Unit (BiGRU), and decoding was performed using BiGRU and a Conditional Random Field (CRF). Zuo et al. [
20] developed an end-to-end sequential labeling model based on BERT-att-BiLSTM-CRF for joint extraction, and knowledge triples were finally obtained using entity–relation matching rules. Ahmed et al. [
12] improved upon Zuo et al. [
20] by employing an attention-based RoBERTa-BiGRU-CRF model for sequential labeling, mitigating limitations of classical pipeline techniques. Despite these advances, traditional sequence-labeling approaches still struggle with overlapping entities due to order constraints and thus suffer from order dependency.
Recently, the non-autoregressive decoder (NAD) has been widely explored for natural language tasks [
21]. Sui et al. [
22] formulated joint entity–relation extraction as a set-prediction problem, relieving the model of the sequential burden of predicting multiple triples. Jun Yu et al. [
23] applied an NAD to the aspect-level sentiment triad extraction (ASTE) task. The encoding layer used MPNet (proposed by Microsoft), and dependencies among predicted tokens were modeled via Permutation Language Modeling (PLM). Combined with the non-autoregressive decoder, this design achieved state-of-the-art results on ASTE. Yanfei Peng et al. [
24] proposed an end-to-end model with a dual ensemble-prediction network to decode triples (termed “ternary groups” in their work). Entity pairs and relations were decoded sequentially, and interactions were strengthened through parameter sharing across the two ensemble-prediction networks. Additionally, entity filters were designed to tighten subject–object connections and to discard triples with low subject/object confidence.
3. Materials and Methods
In this study, unstructured cybersecurity threat intelligence in PDF format was collected from relevant sources. The data were cleaned and preprocessed to obtain sentence-level text. Relation triples present in the sentences were then manually annotated to serve as inputs to the model. The joint entity–relation extraction model consists of two core modules. First, sentences are encoded with a joint BERT–BiGRU encoder to capture sequence dependencies and to provide rich contextual information for the non-autoregressive decoder (NAD). Second, the NAD is used to generate predicted triples in parallel. The model is optimized with a bipartite matching loss, ensuring the best alignment between predicted triples and gold triples.
The method architecture of this paper is shown in
Figure 1, where the sentence
is the input to the model with length
l (containing [CLS] and [SEP]), and the set of relation triples in the sentence is
, where
m is the number of triples, and
E and
R are the set of entities and the set of relations, respectively; the specific examples are shown in
Table 1. There exist two kinds of cases: one is the ordinary example, where there is no overlapping entity in the triple, and the other is the overlapped example, where the the sentence is more complex and there are multiple overlapping entities. A Bidirectional Gated Recurrent Unit (BiGRU) network input vector
denotes the word embedding vector of
after passing through the BERT layer; vector
denotes the word feature representation of
after passing through a layer of BiGRU encoding; vector
denotes the output of the encoder; and
denotes the learnable embeddings, which are used as initialization inputs to the decoder, where
N is the number of learnable embeddings and
d is the initial dimension of the BERT word embeddings.
denotes the output of the decoder.
denotes the final prediction of the obtained relational triple; the final goal of the entity–relationship joint extraction model is to extract the set of relational triples T present in the sentence
S.
The overall architecture is shown in
Figure 1. A sentence is denoted by
with length
l (including [CLS] and [SEP]). The set of relation triples is defined as
, where
m is the number of triples, and
E and
R denote the entity and relation sets, respectively. Examples are provided in
Table 1. Two cases are considered. In the ordinary case, no entity overlap occurs within a triple. In the overlapped case, the sentence is more complex and contains multiple overlapping entities. For the encoder,
is the BERT embedding of
. The vector
is the contextual representation after a BiGRU layer, and
is the encoder output. Learnable embeddings
are used to initialize the decoder, where
N is the number of learnable embeddings and
d is the BERT embedding dimension. The decoder output is
. The final prediction of relation triples is denoted by
. The objective of the joint entity–relation extraction model is to recover the set T of relation triples present in
S.
3.1. Entity Relationship Category Design
A relatively mature system of core concepts for cyber threat intelligence (CTI) and security elements has been established [
25,
26,
27,
28]. Based on this body of knowledge and the CTI corpus collected in this study, eleven entity categories were defined: operating system (OS), malware (MAL), tool (TOO), threat actor (THR), campaign (CAM), vulnerability (VUL), mitigation (MIT), attack technique (ATT), group (GRO), consequence (CON), and organization (ORG), as explained in
Table 2. The following relations were considered: Target, Attributed_to, Mitigates, HasVulnerability, Uses, Includes, Exploits, Associates, Causes, and Alias_of. In total, thirty-two refined entity–relation categories were obtained by enumerating valid entity-pair types, as shown in
Table 3.
3.2. Data Enhancement
The scarcity of data, compounded by the extensive array of relationship categories, has given rise to a distinct challenge: the inadequate annotation of data. To address this challenge, the few-shot learning capabilities of large language models were used, with ChatGPT (version GPT-4o) [
29] being used to implement a structured data augmentation process [
30,
31,
32].
Initial Dataset: We started with 1051 basic samples.
Data augmentation Process: During data augmentation, an instruction-based rewriting process was constructed around each base sample to increase corpus diversity and robustness. For each sentence, a refined prompt was injected: “Original sentence. Please provide six additional sentences on the same theme that remain similar to the original. The generated sentences must be in English. You may expand content and internal relations using Wikipedia and verified facts, and add new content related to the theme. Entities and relations in the rewritten sentences should not change substantially. All generated sentences must be in lowercase.”
To balance efficiency and quality, the OpenAI API with the “deepseek” model was initially employed for batch generation to achieve high throughput. However, the outputs did not meet downstream requirements. Considering the cost–quality trade-off, a switch was made to the web-based GPT-4o for manual, interactive rewriting and quality control. Instructions were entered sequentially. Low-level specifications (all-lowercase text and syntactic fluency) were verified. Consistency of entities and relations was validated. Newly introduced facts were rapidly fact-checked to remove semantic drift and hallucinatory samples.
This procedure markedly improved semantic fidelity and trainability while maintaining controllable generation speed. As a result, enhanced corpora with a higher signal-to-noise ratio were produced for subsequent model training. Representative generated sentences are shown in
Table 4.
Quality Assurance: For quality assurance, a semi-automated quality-control scheme was adopted under resource constraints and in the absence of external security reviewers. The scheme was centered on manual verification by researchers with cybersecurity expertise. Each generated sample was comprehensively reviewed and corrected to ensure syntactic fluency, clarity of meaning, and factual accuracy.
Because the dataset primarily consists of text and entity–relation structures, traditional semantic-similarity measures (e.g., cosine similarity) were deemed inapplicable. An automated screening strategy based on “entity–relation preservation” was therefore developed. A predefined, manually annotated entity–relation dataset was used as the baseline. Normalized definitions of relations were provided to the model, and entity–relation annotations were applied to newly generated sentences. The annotated relations of generated sentences were then compared one-to-one with those of the original samples.
If core entities and their relations remained unchanged, a sample was deemed semantically consistent and retained. If relation drift or entity mismatch was detected, the sample was flagged as non-compliant and excluded. Structural consistency served as the primary quality metric. Manual review supplemented this metric to correct potential model-annotation errors and factual issues, thereby ensuring semantic fidelity and factual reliability without relying on traditional similarity metrics.
Final Dataset: After enhancement and filtering, we obtained a total of 4741 samples, including the original 1051 samples and 3690 generated samples.
3.3. Coding Layer Design for Joint BERT and BiGRU
The primary objective of this layer is to extract context-aware information for each token in a sentence. A Transformer-based bidirectional encoder, BERT, is employed to produce word embeddings [
13]. To incorporate threat-intelligence-specific domain knowledge, the embeddings from the BERT layer are further processed by a BiGRU to obtain deep contextual features of the sentence [
14].
Given an input sentence
, where
is the t-th word in the sentence, in order to better obtain the contextual information of each sequence in the sentence, the pre-trained model BERT is loaded to generate the word embedding vectors, and the outputs are the context-aware word embeddings
,
, where
l stands for is the length of the sentence (including [CLS] and [SEP]), and
d is the number of hidden units of the BERT model. The specific formula is shown in (
1) below:
To improve model generalization, a BiGRU was employed to extract deep context-aware features for each token. BiGRU extends GRU and operates as a bidirectional Recurrent Neural Network. It is designed to mitigate the long-range dependency limitations of traditional RNNs.
A BiGRU comprises two GRU directions. The forward GRU processes the input from left to right, and the backward GRU processes the input from right to left. Using a BiGRU rather than a unidirectional GRU enables the simultaneous capture of past and future context at each time step, thereby improving sequence feature modeling efficiency. Compared with LSTM, GRU uses two gates—the update gate and the reset gate. This design reduces the number of parameters and improves computational efficiency.
The word embedding vector
E output from the BERT layer is used as an input to the BiGRU network, and the semantic features of the sentence are further extracted to obtain new sequence vectors
,
, where the output of BiGRU is spliced from the hidden states of the forward and backward GRUs to capture the contextual information of the sequence more comprehensively. The formulas of BiGRU are shown in (
2)–(
4) below:
where
is the feature representation of the t-th input token,
is the hidden state of the (t − 1)-th input token,
is the hidden state of the (t+1)-th input token,
represents the hidden state of the forward GRU unit,
represents the hidden state of the backward GRU, and
represents the final result of the current sequence after one layer of BiGRU network.
3.4. Parallel Decoding Layer Design
Following [
22], a non-autoregressive decoder (NAD) is employed, which formulates joint entity–relation extraction as an ensemble prediction problem. Unlike autoregressive decoders that condition on previously generated outputs, the NAD uses maskless self-attention to access information at all sequence positions simultaneously. Triples are therefore generated in parallel, which markedly improves efficiency. For each sentence, the decoder outputs a set of N fixed-size predicted triples. The value of N is set greater than the maximum number of triples present in the sentence. The query vector
is initialized with N shared, learnable embeddings and is computed by Equation (
5).
The core of the non-autoregressive decoder consists of M Transformer blocks. Each block comprises three sublayers: multi-head self-attention, multi-head cross-attention, and a feed-forward network. The multi-head self-attention sublayer models intra-triple dependencies. Its output projection is used as the query (
Q) for the multi-head cross-attention sublayer. The encoder outputs provide the keys (
K) and values (
V) for that cross-attention, thereby fusing sentence-level information. The computations of the multi-head cross-attention sublayer are given by Equations (
6)–(
8).
is the output of the previous multi-head self-attention sublayer, and
is the output of the encoder.
,
,
,
are trainable parameters,
h is the number of attention heads,
is the dimensional size of the key vector. The score can be scaled by dividing by
, which helps to enhance the stability of the gradient. The multi-head self-attention sublayer has the same formulas (
6)–(
8), unlike the multi-head cross self-attention sublayer, whose
Q,
K, and
V are initialized learnable embeddings. With the non-autoregressive decoder, the N ternary queries are converted into N output embeddings, which are denoted as
.
Finally, the output embedding
is inputted into the MLP network to decode the entities and relations independently, respectively, and the MLP network predicts to get the relation label
, as well as the start position and end position of the head and tail entities
,
,
, and
by the softmax classifier. The specific equations are shown in (
9), (
10)–(
13) below:
where
and
are the learnable parameters and
is the output of BiGRU encoder based on BERT to obtain the final prediction triples as
.
3.5. Loss Function
For model optimization, the Bipartite Matching Loss (BMLoss) was employed [
22]. Optimal alignments between predicted and gold triples were obtained with the Hungarian algorithm. The loss comprises relation classification and entity position prediction, both computed via negative log-likelihood.