1. Introduction
In recent years, the cyber-threat landscape has grown markedly more complex; data breaches, ransomware incursions, and worm outbreaks have increased in both frequency and severity, imperiling the information security of organizations and society at large. In response to these evolving threats, security professionals have developed the discipline of cyber threat intelligence, which entails the systematic collection and analysis of voluminous data drawn from public advisories, internal system logs, and community-sourced reports, with the ultimate aim of mitigating associated risks [
1,
2]. CTI serves as a proactive defense mechanism; by extracting and correlating insights about adversary tactics, techniques, and procedures, organizations can anticipate risks more effectively and deploy appropriate protective measures [
3]. Through the systematic collection and processing of unstructured and heterogeneous threat-related information, CTI enables security teams to reveal emerging attack patterns, prioritize vulnerabilities, and enhance overall resilience [
4].
Despite these advantages, manual analysis methods are increasingly unable to handle the volume and diversity of modern CTI data. Such approaches typically depend on extensive expert intervention, thereby limiting scalability and impeding timely response. Consequently, there is a growing imperative for automated, high-precision techniques that expedite and enhance CTI-driven decision-making—most notably through the efficient extraction of structured information from textual sources. Within this context, Named Entity Recognition (NER) has emerged as a foundational technique, enabling the identification and categorization of key entities such as malware family names, software vulnerabilities, threat actor designations, and attack vectors.
NER systems have already demonstrated outstanding performance in diverse domains, including legal document analysis [
5,
6], social media monitoring [
7], chemical literature parsing [
8], and biographical texts processing [
9].
In the cybersecurity context, recent developments in cybersecurity NER trace an evolution from classic sequence models to modern pretrained architectures. Early approaches typically employed LSTM combined with CRF for sequence labeling, capturing local context and ensuring label consistency; subsequently, BiLSTM became popular by incorporating both forward and backward context, which improved entity recognition accuracy in CTI reports [
10]. With the rise of Transformer-based models, pretrained language models like BERT have achieved strong results across many NLP tasks [
11,
12], and BERT-CRF—integrating contextual embeddings with structured prediction—has set new benchmarks in CTI entity extraction [
13]. In addition to deep learning techniques, recent work has explored ontology-driven approaches for information extraction. For instance, Zaman et al. [
14] proposed an ontological framework combining fuzzy rule-based systems and word sense disambiguation to extract structured metadata from scientific publications.
Despite these advances, several common challenges remain in CTI-NER. First, RNN-based models can struggle with long documents or cross-sentence dependencies, whereas CTI reports often contain lengthy descriptions and complex relations among entities. Second, BERT’s masked language modeling pretraining can disrupt natural token dependencies and thus limit its ability to model global context across sentences [
15]. For instance, in the DNRTI dataset, the sentence “CrowdStrike attributes the PUTTER PANDA group to PLA Unit 61486 within Shanghai, China” contains a nested organizational structure and distant dependencies between related tokens—both of which RNN- and BERT-based models often fail to capture coherently. Similarly, in the CTI-Reports dataset, the sentence “We decided to have an even closer look at Agent.BTZ and ComRAT and therefore analyzed the evolution of this RAT, covering seven years of development” presents multiple backdoor malware entities (e.g., Agent.BTZ, ComRAT, RAT) that are semantically related but appear across different positions. Such non-contiguous and semantically entangled mentions challenge models that rely heavily on local context or sequential processing. Third, the CTI domain features dense domain-specific terminology, fuzzy entity boundaries, and strong distance and directional dependencies, yet general pretrained models may not adapt fully to these characteristics. These limitations indicate a clear research gap: the need for a robust, context-aware NER architecture capable of modeling global semantics and maintaining label coherence in the presence of complex entity structures.
To address these limitations, we propose an XLNet-CRF framework specifically optimized for CTI-NER. Unlike masked-language models, XLNet uses permutation-based autoregressive pretraining that preserves natural token dependency patterns and avoids explicit masking. Its two-stream self-attention mechanism further enhances modeling long-range and cross-sentence context, which is crucial for capturing complex entity relations in CTI texts [
15]. By combining XLNet’s contextual representations with a CRF layer for structured label prediction, XLNet-CRF maintains sequence consistency while leveraging stronger global dependency modeling. In this work, we focus on effective fine-tuning and adaptation strategies in the CTI domain so that XLNet-CRF outperforms BERT-CRF and similar baselines in terms of entity extraction accuracy and robustness, while also addressing efficiency considerations for near-real-time CTI analysis. To facilitate transparency and reproducibility, we release our source code and experimental configurations at:
https://github.com/wth89/Efficient-Named-Entity-Recognition-for-Cyber-Threat-Intelli-gence-with-Permutation-Language-Modeling (accessed on 29 May 2025).
The main contributions of this paper are summarized as follows:
We propose XLNet-CRF, which integrates XLNet’s permutation-based language modeling with a CRF layer for structured prediction. This design effectively models contextual dependencies across sequences and enhances the accuracy and consistency of named entity labeling in cybersecurity texts.
We address the challenges of long-range dependencies and complex entity relationships in CTI reports by leveraging XLNet’s autoregressive pretraining, which outperforms traditional masked language models in cybersecurity-related NLP tasks.
Comprehensive evaluations on two benchmark cybersecurity corpora validate the efficacy of our XLNet-CRF framework; on CTI-Reports, it achieves an F1-score of 97.43% with a precision of 97.41%, and on MalwareTextDB, it attains an F1-score of 88.65% with a precision of 85.33%, significantly outperforming BERT-based baselines.
The remainder of this paper is organized as follows.
Section 2 reviews related work on CTI-focused NER.
Section 3 describes our proposed XLNet-CRF methodology.
Section 4 presents experimental results, evaluating our approach against existing baselines.
Section 5 provides a discussion of key findings and limitations. Finally,
Section 6 concludes the study and outlines directions for future work.
2. Related Work
2.1. Traditional Machine Learning Approaches
Mulwad et al. [
16] employed a support vector machine (SVM) within a supervised learning framework to perform sequence labeling for entity extraction, optimizing hyperplane parameters and kernel functions directly on annotated training data. Concurrently, Chen et al. [
17] explored the application of Maximum Entropy (ME) models, while Lafferty et al. [
18] introduced CRF as a probabilistic framework for sequence modeling tasks. These traditional approaches rely heavily on manual feature engineering and typically achieve strong performance only in domains with a limited set of entity categories and relatively stable terminology. In contrast, CTI corpora are characterized by continually evolving vocabulary, complex cross-sentence dependencies, and long-range contextual relations; under these conditions, classical models often exhibit degraded generalization, slow adaptation to emerging terms, and reduced accuracy for rare or novel entities.
2.2. RNN-Based Deep Learning Approaches
Recent developments in deep learning have driven the adoption of recurrent neural network (RNN) architectures for entity extraction tasks. Ma et al. [
19] employed a bidirectional long short-term memory network with a conditional random field (BiLSTM-CRF) decoder to identify key entities in CTI reports. Building on this, Srivastava et al. [
20] replaced the LSTM units with bidirectional gated recurrent units (BiGRUs) and integrated convolutional neural networks (CNNs) for local feature learning, coupling these representations with a CRF layer for structured label inference. Xiang et al. [
21] further enhanced the BiGRU-CRF framework by incorporating self-attention mechanisms to better capture long-range dependencies. Although these models significantly outperform traditional methods by learning contextual embeddings in moderate-length texts, their inherently sequential processing impedes the modeling of global context in lengthy CTI documents and incurs substantial computational overhead. To overcome these limitations, attention-based architectures such as the Transformer have been proposed, enabling parallelized modeling of global dependencies and improved scalability for long and complex texts.
2.3. Transformer-Based Pretrained Models
Transformer-based pretrained language models have substantially advanced Named Entity Recognition (NER), including applications within the CTI domain. In particular, BERT-CRF and its variants have become strong baselines for CTI entity extraction tasks. Xie et al. [
22] adopted a BERT-BiLSTM-CRF architecture for Chinese entity recognition, while Sanh et al. [
23] proposed DistilBERT to reduce parameters and speed up inference. More recently, Wang et al. [
24] introduced a hybrid BERT-IDCNN-BiGRU-CRF model that fuses contextual embeddings from BERT with multi-scale feature representations learned via an Iterated Dilated CNN (IDCNN) and BiGRU, thereby enhancing adaptability to diverse entity granularities.
Despite their empirical success, BERT-based models exhibit notable limitations in CTI settings. BERT’s pretraining is based on a masked language modeling (MLM) objective, which introduces artificial [MASK] tokens that do not appear in downstream tasks. This results in a pretrain-finetune discrepancy, potentially harming generalization [
15]. More fundamentally, BERT assumes independence among the masked tokens during reconstruction, preventing it from modeling interdependencies among predicted entities—particularly problematic in CTI documents, which often contain nested or interrelated entities across sentence boundaries. In contrast, XLNet employs a generalized autoregressive pretraining strategy based on permutation language modeling, which enables it to capture bidirectional contextual dependencies without relying on corrupted input. By modeling the expected likelihood over all possible factorization orders of a sequence, XLNet preserves natural token co-occurrence and effectively learns long-range, high-order dependencies, thus avoiding BERT’s independence assumption. Furthermore, XLNet incorporates the segment recurrence mechanism and relative positional encoding from Transformer-XL, rendering it particularly effective for modeling long-form CTI documents with complex cross-sentence dependencies.
3. Methodology
3.1. CRF
NER is commonly formulated as a sequence labeling task and typically adopts the BIO tagging scheme. Although XLNet is capable of generating powerful contextualized embeddings through permutation-based language modeling, it does not inherently support structured prediction. A simple linear classification layer stacked on top of XLNet can only make token-level decisions independently, failing to capture inter-label dependencies. To address this limitation, we integrate a CRF layer on top of the XLNet. The CRF layer replaces the linear classifier and explicitly models the dependencies between adjacent output labels, enabling globally optimal label sequence inference. Formally, given an input token sequence
and its corresponding label sequence
, the CRF defines the conditional probability as Equation (1).
Here,
represents a feature function that captures certain properties of the input and output sequences at position
. For example, in the CTI sentence “The dll that connect to ‘marina-info.net’ may be the last stage-malware…”, the tokens “dll” and “stage-malware” are both labeled as B-malware. backdoor, while “marina-info.net” is labeled as B-url.unknown. These entity types are often correlated and embedded in irregular sentence patterns. In such cases, a linear classifier would treat each token independently, potentially missing the structural relationships between adjacent labels. By contrast, a CRF layer explicitly models the dependencies between entity labels, making it a more suitable choice for structured prediction in CTI texts. For instance, by enforcing valid BIO transitions, the CRF layer helps maintain consistent labeling even when semantically related entities are scattered across complex sentence structures.
3.2. XLNet Model
3.2.1. Permutation Language Modeling
CTI documents often exhibit long-range dependencies and nested or non-contiguous entity mentions. For instance, the token “dll” labeled as malware.backdoor may occur several words before a URL tagged url.unknown, yet both are semantically linked. Conventional left-to-right language models can only condition on preceding context, and masked-language models introduce [MASK] tokens—thereby disrupting surface semantics and necessitating additional adaptation for downstream tasks.
XLNet addresses these issues through its Permutation Language Modeling (PLM) approach. During pretraining, it randomly shuffles the input sequence and maximizes the likelihood of each token based only on the tokens that precede it in the sampled permutation. This method allows XLNet to create genuinely bidirectional representations without the need for inserting artificial tokens. Consequently, XLNet can capture both left and right context.
Figure 1 illustrates this process on a CTI sentence describing a malware–URL interaction. Given the input “The dll that connects to ‘marina-info.net’ may be the last stage-malware triggered under specific conditions”, PLM might sample the permutation “connect may dll ‘marina-info.net’ under stage-malware last the that specific conditions be triggered to”. It then masks out the final two tokens (“triggered” and “to”) and performs autoregressive predictions, first reconstructing “triggered” by attending only to its preceding tokens in the permuted order and then predicting “to” by conditioning on all earlier tokens plus the newly generated “triggered”. Repeating this over many permutations forces the encoder to integrate dispersed contextual cues—such as the relationship between “dll” and “stage-malware” or between “‘marina-info.net’” and its surrounding context—without ever corrupting the input with mask tokens. This bidirectional, permutation-based training thus equips XLNet to learn robust representations of non-contiguous and overlapping threat indicators in CTI text.
3.2.2. Two-Stream Self-Attention
Despite the ability of Permutation Language Modeling (PLM) to learn fully bidirectional contextual representations, allowing a token to attend to its own embedding during prediction can lead to information leakage. To mitigate this issue, XLNet-CRF incorporates a two-stream self-attention mechanism that explicitly decouples context representation from target token prediction. This is achieved through the content stream and query stream, each serving distinct roles in the modeling process.
In the content stream, the hidden state
at position
in layer
is computed using standard self-attention [
25], where the token’s own embedding serves as the query, and the contextual embeddings of all tokens preceding or at position
serve as keys and values. The detailed formulation is presented in Equation (2). As shown in
Figure 2a, this design enables the content stream to capture both the token’s intrinsic semantic representation and its surrounding contextual dependencies without contaminating the prediction target.
In contrast, the query stream computes the hidden state
at position
using the position embedding of
as the query, while treating the content stream’s contextual representations of preceding tokens as both keys and values. This formulation, presented in Equation (3), explicitly excludes the token’s own content from the attention mechanism, thereby ensuring that label prediction is based solely on surrounding context rather than the target token’s embedding. As shown in
Figure 2.b, this two-stream design enforces a clear separation between context encoding and prediction, effectively mitigating the risk of information leakage.
In summary, the content stream integrates both contextual dependencies and the intrinsic content of the current position, while the query stream encodes positional and contextual information, excluding the content of the target token itself. During inference, predictions are generated from the query stream, enabling the model to leverage contextual cues for accurate sequence labeling while preserving the integrity of the predictive process. This mechanism ensures that the prediction is not influenced by the token’s own representation, thus avoiding information leakage and improving sequence labeling reliability. The overall XLNet-CRF architecture is depicted in
Figure 3.
4. Experiments and Results
4.1. Dataset Description
To evaluate the performance of the XLNet-CRF model in the domain of network threat intelligence named entity recognition, this paper utilizes three publicly available datasets for experimentation. These datasets encompass a diverse array of threat intelligence content, including various types of named entities such as domain names, malware names, and attack techniques, thereby providing a multi-dimensional reference for model evaluation. To ensure the scientific validity and robustness of the experimental results, the datasets are partitioned into training (70%), validation (10%), and testing (20%) sets. Furthermore, all data adhere to a unified BIO annotation format, which facilitates the model’s accurate identification of entity boundaries and categories during both training and evaluation, thereby enhancing the reliability and comparability of the results. The subsequent sections will introduce these three datasets.
4.1.1. DNRTI Dataset
The DNRTI dataset [
26] encompasses 13 distinct categories of information annotations, including hacker organization, attack, sample file, security team, tool, time, purpose, area, industry, organization, way, loophole, and features. The corresponding labels of these 13 categories in the dataset are HackOrg, OffAct, SamFile, SecTeam, Tool, Time, Purp, Area, Idus, Org, Way, Exp, and Features, respectively. The detailed distribution of the labels is shown in
Table 1.
4.1.2. CTI-Reports Dataset
The CTI-Reports dataset [
27], released by nlpai-lab on GitHub, consists of 310,406 records of threat intelligence reports. These records are categorized into four main types: malware, IP addresses, URLs, and hash values. The malware category is further subdivided into ten subcategories: malware.backdoor, malware.infosteal, malware.ransom, malware.unknown, malware.drop, url.normal, url.unknown, url.cncsvr, ip.unknown, and hash. The quantities and proportions of each label are illustrated in
Table 2.
4.1.3. MalwareTextDB Dataset
MalwareTextDB [
28] is the inaugural dataset specifically designed for the text annotation of malicious software reports. Its primary purpose is to provide a structured and fully annotated corpus in the field of cybersecurity, facilitating the automatic extraction of behaviors, attributes, and related information of malicious software through natural language processing technology. This dataset encompasses three types of entities: Action, Modifier, and Entity. The Entity label annotates noun phrases related to malware actions, including the initiator (Subject) or recipient (Object) of an action, as well as phrases providing additional context. The Action label refers to an event, such as “registers”, “provides”, and “is written”. The Modifier label refers to tokens that link to other word phrases that provide elaboration on the Action such as “as” and “to”. The quantities and proportions of each label are shown in
Table 3.
4.2. Experimental Setup and Evaluation Metrics
In this study, we utilized the NVIDIA RTX 3090(NVIDIA Corporation, Santa Clara, CA, USA), which features 24 GB of memory, in conjunction with a 14 vCPU Intel Xeon Platinum 8362 operating (Intel Corporation, Santa Clara, CA, USA) at 2.80 GHz. The system was configured with Ubuntu 22.04.3, and the primary development framework consisted of Python 3.8 and PyTorch 2.0.0, while CUDA 12.1 was employed to fully leverage GPU acceleration. The training configurations for XLNet-CRF across different datasets are summarized in
Table 4. Additionally, the proposed XLNet-CRF model is evaluated using widely adopted metrics in the field of NER: precision, recall, and F1-score.
4.3. Experimental Results and Analysis
Table 5 summarizes the evaluation results of five models: BERT-CRF, secBERT-CRF, BERT-BiLSTM-CRF, secBERT-BiLSTM-CRF, and the proposed XLNet-CRF.
4.3.1. Result on CTI-Reports
As shown in
Table 5, the CTI-Reports dataset reveals notable variations in model performance across precision, recall, and F1-score. The BERT-CRF model achieves an F1-score of 77.29%, with exceptionally high precision (98.37%) but relatively low recall (74.10%). This suggests that while BERT-CRF is effective in identifying named entities with high confidence, it fails to detect a substantial portion of relevant instances, thereby limiting its recall.
The secBERT-CRF model, a security-domain-enhanced variant, performs slightly worse overall, yielding an F1-score of 72.52%, primarily due to lower precision (97.42%) and recall (66.69%). The incorporation of a BiLSTM layer leads to mixed results; the BERT-BiLSTM-CRF model achieves an F1-score of 74.39%, with a recall of 66.27%, indicating marginal improvement in sequence modeling but limited gains in recall. Notably, the secBERT-BiLSTM-CRF model records a recall of 80.31%—the highest among the baselines—but its precision drops to 97.31%, resulting in an overall F1-score of 68.05%. This pattern implies that while BiLSTM enhances the model’s ability to capture sequential dependencies, it also introduces more false positives. As shown in
Figure 4, this effect can be observed more intuitively through the comparative analysis of prediction outcomes.
In contrast, the proposed XLNet-CRF model consistently outperforms these baselines by leveraging superior contextual understanding, positional flexibility, and robust subtoken integration. For instance, in the sentence “Some victims of Operation Manul have expressed a desire to preserve their anonymity, which we respect”, BERT-CRF fails to identify “Operation Manul” as a malware.ransom entity, while XLNet-CRF correctly labels the two-token span. This can be attributed to XLNet’s permutation-based pretraining, which enables each token to attend to all surrounding positions, thereby capturing cohesive representations of multi-word expressions. Moreover, its use of relative positional encoding allows for dynamic modeling of inter-token relationships, even when tokens are non-contiguous—something absolute positional encodings often fail to achieve.
Although XLNet-CRF demonstrates robust aggregate performance, its effectiveness may be uneven across entity types due to several underlying factors.
First, the CTI-Reports dataset is heavily skewed toward dominant categories such as hash and malware.backdoor, while minority classes like malware.ransom and url.normal are sparsely represented. This imbalance causes the model to learn more stable decision boundaries for frequent classes, while underperforming on rare categories.
Second, entity types differ in structural and contextual cues. Structured identifiers such as IPv4 addresses (e.g., 192.168.100.45) consist of digits and punctuation with minimal semantic context, making them difficult to identify. In contrast, malware entities often appear alongside technical terms (e.g., “C2 node”, “proxy infrastructure”), which offer stronger contextual signals.
Third, the tokenization process may fragment certain structured entities into multiple subtokens. For example, a tokenizer might split 192.168.100.45 into tokens such as “192”, “.”, “168”, etc., dispersing its semantic content and potentially hindering coherent representation and recognition.
These factors may contribute to observed performance differences across entity types, even when aggregate metrics remain high. In particular, the model may perform more consistently on frequent and semantically rich entities, while exhibiting limitations when dealing with infrequent, structurally complex, or token-fragmented instances.
4.3.2. Result on MalwareTextDB
As shown in
Table 5, model performance on the MalwareTextDB dataset varies significantly, with XLNet-CRF achieving the most balanced results—precision of 85.33%, recall of 92.24%, and an F1-score of 88.65%—demonstrating its effectiveness in capturing long-range and bidirectional context. In contrast, BERT-CRF, despite attaining high precision (87.76%), exhibits extremely low recall (47.39%), resulting in a substantially reduced F1-score (58.57%). This disparity highlights the limitations of masked-language modeling in scenarios where complex entity spans must be captured under limited annotation coverage.
A distinctive feature of MalwareTextDB is its small label set—comprising only three entity types (Entity, Action, and Modifier)—paired with dense and overlapping annotations. For example, in the annotated span “... they are (B-Action) infected (I-Action) with (B-Modifier) a rare APT Trojan (B/I/I/I-Entity) posing (B-Action) as (B-Modifier) any one of several major software releases (B/I/I/I/I/I-Entity)”, the model must accurately capture long, nested, and semantically entangled spans. XLNet-CRF’s permutation-based pretraining enables it to attend to bidirectional context and encode multi-token dependencies, thereby supporting coherent span boundary detection for complex constructions such as modifier–action–entity triplets. In contrast, BERT-based models—particularly those without domain-specific adaptation—tend to produce fragmented or incomplete predictions due to their limited global contextual reasoning.
Integrating BiLSTM into BERT (i.e., BERT-BiLSTM-CRF) maintains comparable precision (85.59%) but further reduces recall to 38.92%, as shown in
Figure 5. This decline may result from overfitting on limited training samples or interference between the sequential encoder (BiLSTM) and contextual embeddings. The domain-adapted secBERT-CRF slightly improves recall (57.16%) and F1-score (62.53%) over BERT-CRF, reflecting the advantage of exposure to security-specific terminology. However, secBERT-BiLSTM-CRF exhibits inconsistent performance; while its recall peaks at 69.72%, the F1-score drops to 47.07%, suggesting that the combination of domain adaptation and architectural complexity may hinder overall model stability.
4.3.3. Result on DNRTI
As shown in
Table 5, XLNet-CRF underperforms on the DNRTI dataset (precision 82.86%, recall 81.92%, F1 82.39%) relative to its BERT-based counterparts. In particular, BERT-CRF achieves the highest F1 score (90.02%), and secBERT-CRF strikes an excellent balance between precision (96.00%) and recall (88.80%), reflecting the benefits of domain-specific pretraining. By contrast, XLNet-CRF’s permutation-based objective—while powerful for capturing long-range dependencies—appears less adept at modeling DNRTI’s compact, specialized entity types (e.g., email-style identifiers and short multi-word spans).
A closer examination reveals that XLNet-CRF’s permutation-based objective—although effective for capturing long-range dependencies—may be less suited to DNRTI’s concise, high-density entity types, such as email-style identifiers and short multi-word expressions. For instance, in the sentence “The admin@338 started targeting Hong Kong media companies, probably in response to political and economic challenges in Hong Kong and China”, BERT-CRF correctly labels “admin@338” (HackOrg), “Hong Kong” (Area), and “media companies” (Org), whereas XLNet-CRF tends to fragment or misclassify these spans. This discrepancy can be attributed to several factors; BERT’s WordPiece tokenizer—especially when domain-adapted—more effectively handles email-like tokens, and its masked language modeling objective promotes contiguous span prediction. In contrast, XLNet’s permutation-based modeling may assign less weight to short, information-dense sequences. Moreover, its general-purpose tokenizer can split coherent tokens into fragmented subtokens (e.g., “admin”, “##@338”), diluting semantic cues and hindering accurate CRF decoding. These findings suggest that further adaptation may be necessary for XLNet-CRF to handle short, domain-specific entities effectively.
Furthermore, the integration of BiLSTM layers into BERT-based architectures offers only marginal improvements on DNRTI, as shown in
Figure 6, suggesting that sequential modeling provides limited additional value for this dataset’s compact and domain-specific entity structures.
5. Discussion
The experimental results across the three cybersecurity datasets reveal distinct performance patterns shaped by the interplay between dataset characteristics and model architecture. XLNet-CRF consistently achieves strong overall results on the CTI-Reports and MalwareTextDB datasets, where its permutation-based pretraining proves effective in modeling long-range dependencies and capturing complex entity structures. However, its performance declines on the DNRTI dataset, underscoring that model effectiveness is not universal and depends heavily on factors such as pretraining objectives, and entity type characteristics.
On CTI-Reports, XLNet-CRF attains the most balanced performance, surpassing BERT-based baselines in both recall and F1-score. This advantage stems from three core strengths: (1) its permutation-based training objective facilitates rich contextual representation, (2) its relative positional encoding supports flexible cross-sentence modeling, and (3) its robustness in reconstructing subword-tokenized entities enhances span-level coherence. For example, XLNet-CRF successfully identifies multi-token entities like “Operation Manual”, which BERT-CRF fails to capture, due to its capacity to model noncontiguous dependencies and integrate dispersed semantic cues. Nevertheless, a closer inspection reveals variability across entity categories. Lexically consistent types (e.g., hash) are recognized accurately, while rare or morphologically complex entities (e.g., IPv4 addresses, url.normal) are more error-prone. These discrepancies are likely influenced by class imbalance, subtoken fragmentation, and the limited contextual cues associated with structurally sparse entities.
In MalwareTextDB, which contains only three entity types but features dense, nested, and overlapping annotations, XLNet-CRF again delivers superior results. Its bidirectional context modeling proves particularly effective for extracting structurally interwoven patterns, such as modifier–action–entity triplets. BERT-based models, while exhibiting high precision, suffer from low recall, reflecting a tendency to miss multi-token spans under sparse supervision. Notably, incorporating BiLSTM layers into BERT architectures results in unstable performance—potentially due to overfitting on limited samples or interference between sequential and contextual encoders—suggesting that added complexity does not necessarily translate into better generalization, especially when entity boundaries are fluid or hierarchical.
Conversely, this advantage does not carry over to the DNRTI dataset. Here, XLNet-CRF underperforms relative to both general and domain-adapted BERT variants. The dataset’s compact and lexically specific entity types—such as email-like identifiers (“admin@338”) and short multi-word phrases (“media companies”)—are better accommodated by BERT’s masked language modeling and WordPiece tokenizer. Domain-adapted variants like secBERT-CRF benefit further from exposure to security-related terminology. XLNet-CRF, by contrast, struggles with fragmented subtoken representations and places less emphasis on contiguous spans, resulting in diminished span-level consistency. This highlights a potential limitation of permutation-based modeling in domains characterized by short, lexically rigid, and structurally compact entities.
Collectively, these findings suggest that no single architecture universally outperforms others across all cybersecurity NER tasks. XLNet-CRF is better suited to scenarios involving long, nested, or semantically diffuse spans, while BERT-based models—especially those augmented with domain adaptation—excel in recognizing short, dense, and lexically distinctive entities. Future research may benefit from hybrid approaches that integrate the strengths of both paradigms, such as combining permutation-based attention with domain-specific tokenization or leveraging adapter modules for adaptive fine-tuning in varied CTI environments.
6. Conclusions and Future Work
This study presents a NER-based framework for CTI analysis, aimed at improving the efficiency of entity extraction from unstructured threat data. By incorporating a two-stream self-attention mechanism, the proposed XLNet-CRF model effectively captures long-range dependencies and contextual semantics. The integration of a CRF layer further enhances label consistency, enabling accurate recognition of key security entities such as vulnerabilities, attack techniques, and affected assets. Experimental results on benchmark datasets—including CTI-Reports and MalwareTextDB—demonstrate the model’s robustness in balancing contextual understanding with structured entity recognition.
Nonetheless, performance degradation is observed on the DNRTI dataset, which is characterized by domain-specific terminology and compact entity spans. This limitation highlights the challenge of adapting pretrained models to specialized cybersecurity domains. While XLNet’s permutation-based pretraining captures bidirectional dependencies effectively, it may require further adaptation to accommodate domain-specific linguistic structures and short-text contexts.
In future work, we will first focus on enhancing the domain adaptability of XLNet through targeted fine-tuning on cybersecurity-specific corpora. This will help mitigate the performance limitations observed on short or terminology-heavy CTI texts and further improve entity recognition accuracy in low-resource subcategories.
Building upon improved NER outputs, we then plan to integrate the extracted entities into a structured cyber threat knowledge graph to support downstream reasoning and analytical tasks. By linking entities such as malware names, IP addresses, and vulnerabilities into a graph-based representation, we aim to explicitly model relationships among threat actors, tactics, and indicators. This integration will support cross-document entity linking, enrich sparse threat contexts, and enable advanced CTI applications such as threat correlation, actor profiling, and predictive defense strategy development.