XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling

Wang, Tianhao; Liu, Yang; Liang, Chao; Wang, Bailing; Liu, Hongri

doi:10.3390/electronics14153034

Open AccessArticle

XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling

by

Tianhao Wang

¹,

Yang Liu

¹,

Chao Liang

¹,

Bailing Wang

^2,3 and

Hongri Liu

^1,4,*

¹

School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China

²

Shandong Key Laboratory of Industrial Network Security, Weihai 264209, China

³

Harbin Institute of Technology (Weihai) Qingdao Research Institute, Qingdao 266000, China

⁴

Weihai Cyberguard Technologies Co. Ltd., Weihai 264209, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3034; https://doi.org/10.3390/electronics14153034

Submission received: 1 June 2025 / Revised: 16 July 2025 / Accepted: 28 July 2025 / Published: 30 July 2025

Download

Browse Figures

Versions Notes

Abstract

As cyberattacks continue to rise in frequency and sophistication, extracting actionable Cyber Threat Intelligence (CTI) from diverse online sources has become critical for proactive threat detection and defense. However, accurately identifying complex entities from lengthy and heterogeneous threat reports remains challenging due to long-range dependencies and domain-specific terminology. To address this, we propose XLNet-CRF, a hybrid framework that combines permutation-based language modeling with structured prediction using Conditional Random Fields (CRF) to enhance Named Entity Recognition (NER) in cybersecurity contexts. XLNet-CRF directly addresses key challenges in CTI-NER by modeling bidirectional dependencies and capturing non-contiguous semantic patterns more effectively than traditional approaches. Comprehensive evaluations on two benchmark cybersecurity corpora validate the efficacy of our approach. On the CTI-Reports dataset, XLNet-CRF achieves a precision of 97.41% and an F1-score of 97.43%; on MalwareTextDB, it attains a precision of 85.33% and an F1-score of 88.65%—significantly surpassing strong BERT-based baselines in both accuracy and robustness.

Keywords:

cyber security; permutation language modeling; cyber threat intelligence; named entity recognition; deep learning; conditional random fields

1. Introduction

In recent years, the cyber-threat landscape has grown markedly more complex; data breaches, ransomware incursions, and worm outbreaks have increased in both frequency and severity, imperiling the information security of organizations and society at large. In response to these evolving threats, security professionals have developed the discipline of cyber threat intelligence, which entails the systematic collection and analysis of voluminous data drawn from public advisories, internal system logs, and community-sourced reports, with the ultimate aim of mitigating associated risks [1,2]. CTI serves as a proactive defense mechanism; by extracting and correlating insights about adversary tactics, techniques, and procedures, organizations can anticipate risks more effectively and deploy appropriate protective measures [3]. Through the systematic collection and processing of unstructured and heterogeneous threat-related information, CTI enables security teams to reveal emerging attack patterns, prioritize vulnerabilities, and enhance overall resilience [4].

Despite these advantages, manual analysis methods are increasingly unable to handle the volume and diversity of modern CTI data. Such approaches typically depend on extensive expert intervention, thereby limiting scalability and impeding timely response. Consequently, there is a growing imperative for automated, high-precision techniques that expedite and enhance CTI-driven decision-making—most notably through the efficient extraction of structured information from textual sources. Within this context, Named Entity Recognition (NER) has emerged as a foundational technique, enabling the identification and categorization of key entities such as malware family names, software vulnerabilities, threat actor designations, and attack vectors.

NER systems have already demonstrated outstanding performance in diverse domains, including legal document analysis [5,6], social media monitoring [7], chemical literature parsing [8], and biographical texts processing [9].

In the cybersecurity context, recent developments in cybersecurity NER trace an evolution from classic sequence models to modern pretrained architectures. Early approaches typically employed LSTM combined with CRF for sequence labeling, capturing local context and ensuring label consistency; subsequently, BiLSTM became popular by incorporating both forward and backward context, which improved entity recognition accuracy in CTI reports [10]. With the rise of Transformer-based models, pretrained language models like BERT have achieved strong results across many NLP tasks [11,12], and BERT-CRF—integrating contextual embeddings with structured prediction—has set new benchmarks in CTI entity extraction [13]. In addition to deep learning techniques, recent work has explored ontology-driven approaches for information extraction. For instance, Zaman et al. [14] proposed an ontological framework combining fuzzy rule-based systems and word sense disambiguation to extract structured metadata from scientific publications.

Despite these advances, several common challenges remain in CTI-NER. First, RNN-based models can struggle with long documents or cross-sentence dependencies, whereas CTI reports often contain lengthy descriptions and complex relations among entities. Second, BERT’s masked language modeling pretraining can disrupt natural token dependencies and thus limit its ability to model global context across sentences [15]. For instance, in the DNRTI dataset, the sentence “CrowdStrike attributes the PUTTER PANDA group to PLA Unit 61486 within Shanghai, China” contains a nested organizational structure and distant dependencies between related tokens—both of which RNN- and BERT-based models often fail to capture coherently. Similarly, in the CTI-Reports dataset, the sentence “We decided to have an even closer look at Agent.BTZ and ComRAT and therefore analyzed the evolution of this RAT, covering seven years of development” presents multiple backdoor malware entities (e.g., Agent.BTZ, ComRAT, RAT) that are semantically related but appear across different positions. Such non-contiguous and semantically entangled mentions challenge models that rely heavily on local context or sequential processing. Third, the CTI domain features dense domain-specific terminology, fuzzy entity boundaries, and strong distance and directional dependencies, yet general pretrained models may not adapt fully to these characteristics. These limitations indicate a clear research gap: the need for a robust, context-aware NER architecture capable of modeling global semantics and maintaining label coherence in the presence of complex entity structures.

To address these limitations, we propose an XLNet-CRF framework specifically optimized for CTI-NER. Unlike masked-language models, XLNet uses permutation-based autoregressive pretraining that preserves natural token dependency patterns and avoids explicit masking. Its two-stream self-attention mechanism further enhances modeling long-range and cross-sentence context, which is crucial for capturing complex entity relations in CTI texts [15]. By combining XLNet’s contextual representations with a CRF layer for structured label prediction, XLNet-CRF maintains sequence consistency while leveraging stronger global dependency modeling. In this work, we focus on effective fine-tuning and adaptation strategies in the CTI domain so that XLNet-CRF outperforms BERT-CRF and similar baselines in terms of entity extraction accuracy and robustness, while also addressing efficiency considerations for near-real-time CTI analysis. To facilitate transparency and reproducibility, we release our source code and experimental configurations at: https://github.com/wth89/Efficient-Named-Entity-Recognition-for-Cyber-Threat-Intelli-gence-with-Permutation-Language-Modeling (accessed on 29 May 2025).

The main contributions of this paper are summarized as follows:

We propose XLNet-CRF, which integrates XLNet’s permutation-based language modeling with a CRF layer for structured prediction. This design effectively models contextual dependencies across sequences and enhances the accuracy and consistency of named entity labeling in cybersecurity texts.
We address the challenges of long-range dependencies and complex entity relationships in CTI reports by leveraging XLNet’s autoregressive pretraining, which outperforms traditional masked language models in cybersecurity-related NLP tasks.
Comprehensive evaluations on two benchmark cybersecurity corpora validate the efficacy of our XLNet-CRF framework; on CTI-Reports, it achieves an F1-score of 97.43% with a precision of 97.41%, and on MalwareTextDB, it attains an F1-score of 88.65% with a precision of 85.33%, significantly outperforming BERT-based baselines.

The remainder of this paper is organized as follows. Section 2 reviews related work on CTI-focused NER. Section 3 describes our proposed XLNet-CRF methodology. Section 4 presents experimental results, evaluating our approach against existing baselines. Section 5 provides a discussion of key findings and limitations. Finally, Section 6 concludes the study and outlines directions for future work.

2. Related Work

2.1. Traditional Machine Learning Approaches

Mulwad et al. [16] employed a support vector machine (SVM) within a supervised learning framework to perform sequence labeling for entity extraction, optimizing hyperplane parameters and kernel functions directly on annotated training data. Concurrently, Chen et al. [17] explored the application of Maximum Entropy (ME) models, while Lafferty et al. [18] introduced CRF as a probabilistic framework for sequence modeling tasks. These traditional approaches rely heavily on manual feature engineering and typically achieve strong performance only in domains with a limited set of entity categories and relatively stable terminology. In contrast, CTI corpora are characterized by continually evolving vocabulary, complex cross-sentence dependencies, and long-range contextual relations; under these conditions, classical models often exhibit degraded generalization, slow adaptation to emerging terms, and reduced accuracy for rare or novel entities.

2.2. RNN-Based Deep Learning Approaches

Recent developments in deep learning have driven the adoption of recurrent neural network (RNN) architectures for entity extraction tasks. Ma et al. [19] employed a bidirectional long short-term memory network with a conditional random field (BiLSTM-CRF) decoder to identify key entities in CTI reports. Building on this, Srivastava et al. [20] replaced the LSTM units with bidirectional gated recurrent units (BiGRUs) and integrated convolutional neural networks (CNNs) for local feature learning, coupling these representations with a CRF layer for structured label inference. Xiang et al. [21] further enhanced the BiGRU-CRF framework by incorporating self-attention mechanisms to better capture long-range dependencies. Although these models significantly outperform traditional methods by learning contextual embeddings in moderate-length texts, their inherently sequential processing impedes the modeling of global context in lengthy CTI documents and incurs substantial computational overhead. To overcome these limitations, attention-based architectures such as the Transformer have been proposed, enabling parallelized modeling of global dependencies and improved scalability for long and complex texts.

2.3. Transformer-Based Pretrained Models

Transformer-based pretrained language models have substantially advanced Named Entity Recognition (NER), including applications within the CTI domain. In particular, BERT-CRF and its variants have become strong baselines for CTI entity extraction tasks. Xie et al. [22] adopted a BERT-BiLSTM-CRF architecture for Chinese entity recognition, while Sanh et al. [23] proposed DistilBERT to reduce parameters and speed up inference. More recently, Wang et al. [24] introduced a hybrid BERT-IDCNN-BiGRU-CRF model that fuses contextual embeddings from BERT with multi-scale feature representations learned via an Iterated Dilated CNN (IDCNN) and BiGRU, thereby enhancing adaptability to diverse entity granularities.

Despite their empirical success, BERT-based models exhibit notable limitations in CTI settings. BERT’s pretraining is based on a masked language modeling (MLM) objective, which introduces artificial [MASK] tokens that do not appear in downstream tasks. This results in a pretrain-finetune discrepancy, potentially harming generalization [15]. More fundamentally, BERT assumes independence among the masked tokens during reconstruction, preventing it from modeling interdependencies among predicted entities—particularly problematic in CTI documents, which often contain nested or interrelated entities across sentence boundaries. In contrast, XLNet employs a generalized autoregressive pretraining strategy based on permutation language modeling, which enables it to capture bidirectional contextual dependencies without relying on corrupted input. By modeling the expected likelihood over all possible factorization orders of a sequence, XLNet preserves natural token co-occurrence and effectively learns long-range, high-order dependencies, thus avoiding BERT’s independence assumption. Furthermore, XLNet incorporates the segment recurrence mechanism and relative positional encoding from Transformer-XL, rendering it particularly effective for modeling long-form CTI documents with complex cross-sentence dependencies.

3. Methodology

3.1. CRF

NER is commonly formulated as a sequence labeling task and typically adopts the BIO tagging scheme. Although XLNet is capable of generating powerful contextualized embeddings through permutation-based language modeling, it does not inherently support structured prediction. A simple linear classification layer stacked on top of XLNet can only make token-level decisions independently, failing to capture inter-label dependencies. To address this limitation, we integrate a CRF layer on top of the XLNet. The CRF layer replaces the linear classifier and explicitly models the dependencies between adjacent output labels, enabling globally optimal label sequence inference. Formally, given an input token sequence

X = {x_{1}, x_{2}, \dots, x_{n}}

and its corresponding label sequence

Y = {y_{1}, y_{2}, \dots, y_{n}}

, the CRF defines the conditional probability as Equation (1).

P (Y| X) = \frac{1}{Z (X)} \exp (\sum_{t = 1}^{n} \sum_{k} λ_{k} f_{k} (y_{t - 1}, y_{t}, X, t))

(1)

Here,

f_{k} (y_{t - 1}, y_{t}, X, t)

represents a feature function that captures certain properties of the input and output sequences at position

t

. For example, in the CTI sentence “The dll that connect to ‘marina-info.net’ may be the last stage-malware…”, the tokens “dll” and “stage-malware” are both labeled as B-malware. backdoor, while “marina-info.net” is labeled as B-url.unknown. These entity types are often correlated and embedded in irregular sentence patterns. In such cases, a linear classifier would treat each token independently, potentially missing the structural relationships between adjacent labels. By contrast, a CRF layer explicitly models the dependencies between entity labels, making it a more suitable choice for structured prediction in CTI texts. For instance, by enforcing valid BIO transitions, the CRF layer helps maintain consistent labeling even when semantically related entities are scattered across complex sentence structures.

3.2. XLNet Model

3.2.1. Permutation Language Modeling

CTI documents often exhibit long-range dependencies and nested or non-contiguous entity mentions. For instance, the token “dll” labeled as malware.backdoor may occur several words before a URL tagged url.unknown, yet both are semantically linked. Conventional left-to-right language models can only condition on preceding context, and masked-language models introduce [MASK] tokens—thereby disrupting surface semantics and necessitating additional adaptation for downstream tasks.

XLNet addresses these issues through its Permutation Language Modeling (PLM) approach. During pretraining, it randomly shuffles the input sequence and maximizes the likelihood of each token based only on the tokens that precede it in the sampled permutation. This method allows XLNet to create genuinely bidirectional representations without the need for inserting artificial tokens. Consequently, XLNet can capture both left and right context. Figure 1 illustrates this process on a CTI sentence describing a malware–URL interaction. Given the input “The dll that connects to ‘marina-info.net’ may be the last stage-malware triggered under specific conditions”, PLM might sample the permutation “connect may dll ‘marina-info.net’ under stage-malware last the that specific conditions be triggered to”. It then masks out the final two tokens (“triggered” and “to”) and performs autoregressive predictions, first reconstructing “triggered” by attending only to its preceding tokens in the permuted order and then predicting “to” by conditioning on all earlier tokens plus the newly generated “triggered”. Repeating this over many permutations forces the encoder to integrate dispersed contextual cues—such as the relationship between “dll” and “stage-malware” or between “‘marina-info.net’” and its surrounding context—without ever corrupting the input with mask tokens. This bidirectional, permutation-based training thus equips XLNet to learn robust representations of non-contiguous and overlapping threat indicators in CTI text.

3.2.2. Two-Stream Self-Attention

Despite the ability of Permutation Language Modeling (PLM) to learn fully bidirectional contextual representations, allowing a token to attend to its own embedding during prediction can lead to information leakage. To mitigate this issue, XLNet-CRF incorporates a two-stream self-attention mechanism that explicitly decouples context representation from target token prediction. This is achieved through the content stream and query stream, each serving distinct roles in the modeling process.

In the content stream, the hidden state

h_{z (t)}^{m}

at position

z (t)

in layer

m

is computed using standard self-attention [25], where the token’s own embedding serves as the query, and the contextual embeddings of all tokens preceding or at position

z (t)

serve as keys and values. The detailed formulation is presented in Equation (2). As shown in Figure 2a, this design enables the content stream to capture both the token’s intrinsic semantic representation and its surrounding contextual dependencies without contaminating the prediction target.

h_{z (t)}^{m} = A t t e n t i o n (Q = h_{z (t)}^{m - 1}, K = V = \{h_{z (1)}^{m - 1}, \dots, h_{z (t)}^{m - 1}\})

(2)

In contrast, the query stream computes the hidden state

g_{z (t)}^{m}

at position

z (t)

using the position embedding of

z (t)

as the query, while treating the content stream’s contextual representations of preceding tokens as both keys and values. This formulation, presented in Equation (3), explicitly excludes the token’s own content from the attention mechanism, thereby ensuring that label prediction is based solely on surrounding context rather than the target token’s embedding. As shown in Figure 2.b, this two-stream design enforces a clear separation between context encoding and prediction, effectively mitigating the risk of information leakage.

g_{z (t)}^{m} = A t t e n t i o n (Q = g_{z (t)}^{m - 1}, K = V = \{h_{z (1)}^{m - 1}, \dots, h_{z (t - 1)}^{m - 1}\})

(3)

In summary, the content stream integrates both contextual dependencies and the intrinsic content of the current position, while the query stream encodes positional and contextual information, excluding the content of the target token itself. During inference, predictions are generated from the query stream, enabling the model to leverage contextual cues for accurate sequence labeling while preserving the integrity of the predictive process. This mechanism ensures that the prediction is not influenced by the token’s own representation, thus avoiding information leakage and improving sequence labeling reliability. The overall XLNet-CRF architecture is depicted in Figure 3.

4. Experiments and Results

4.1. Dataset Description

To evaluate the performance of the XLNet-CRF model in the domain of network threat intelligence named entity recognition, this paper utilizes three publicly available datasets for experimentation. These datasets encompass a diverse array of threat intelligence content, including various types of named entities such as domain names, malware names, and attack techniques, thereby providing a multi-dimensional reference for model evaluation. To ensure the scientific validity and robustness of the experimental results, the datasets are partitioned into training (70%), validation (10%), and testing (20%) sets. Furthermore, all data adhere to a unified BIO annotation format, which facilitates the model’s accurate identification of entity boundaries and categories during both training and evaluation, thereby enhancing the reliability and comparability of the results. The subsequent sections will introduce these three datasets.

4.1.1. DNRTI Dataset

The DNRTI dataset [26] encompasses 13 distinct categories of information annotations, including hacker organization, attack, sample file, security team, tool, time, purpose, area, industry, organization, way, loophole, and features. The corresponding labels of these 13 categories in the dataset are HackOrg, OffAct, SamFile, SecTeam, Tool, Time, Purp, Area, Idus, Org, Way, Exp, and Features, respectively. The detailed distribution of the labels is shown in Table 1.

4.1.2. CTI-Reports Dataset

The CTI-Reports dataset [27], released by nlpai-lab on GitHub, consists of 310,406 records of threat intelligence reports. These records are categorized into four main types: malware, IP addresses, URLs, and hash values. The malware category is further subdivided into ten subcategories: malware.backdoor, malware.infosteal, malware.ransom, malware.unknown, malware.drop, url.normal, url.unknown, url.cncsvr, ip.unknown, and hash. The quantities and proportions of each label are illustrated in Table 2.

4.1.3. MalwareTextDB Dataset

MalwareTextDB [28] is the inaugural dataset specifically designed for the text annotation of malicious software reports. Its primary purpose is to provide a structured and fully annotated corpus in the field of cybersecurity, facilitating the automatic extraction of behaviors, attributes, and related information of malicious software through natural language processing technology. This dataset encompasses three types of entities: Action, Modifier, and Entity. The Entity label annotates noun phrases related to malware actions, including the initiator (Subject) or recipient (Object) of an action, as well as phrases providing additional context. The Action label refers to an event, such as “registers”, “provides”, and “is written”. The Modifier label refers to tokens that link to other word phrases that provide elaboration on the Action such as “as” and “to”. The quantities and proportions of each label are shown in Table 3.

4.2. Experimental Setup and Evaluation Metrics

In this study, we utilized the NVIDIA RTX 3090(NVIDIA Corporation, Santa Clara, CA, USA), which features 24 GB of memory, in conjunction with a 14 vCPU Intel Xeon Platinum 8362 operating (Intel Corporation, Santa Clara, CA, USA) at 2.80 GHz. The system was configured with Ubuntu 22.04.3, and the primary development framework consisted of Python 3.8 and PyTorch 2.0.0, while CUDA 12.1 was employed to fully leverage GPU acceleration. The training configurations for XLNet-CRF across different datasets are summarized in Table 4. Additionally, the proposed XLNet-CRF model is evaluated using widely adopted metrics in the field of NER: precision, recall, and F1-score.

4.3. Experimental Results and Analysis

Table 5 summarizes the evaluation results of five models: BERT-CRF, secBERT-CRF, BERT-BiLSTM-CRF, secBERT-BiLSTM-CRF, and the proposed XLNet-CRF.

4.3.1. Result on CTI-Reports

As shown in Table 5, the CTI-Reports dataset reveals notable variations in model performance across precision, recall, and F1-score. The BERT-CRF model achieves an F1-score of 77.29%, with exceptionally high precision (98.37%) but relatively low recall (74.10%). This suggests that while BERT-CRF is effective in identifying named entities with high confidence, it fails to detect a substantial portion of relevant instances, thereby limiting its recall.

The secBERT-CRF model, a security-domain-enhanced variant, performs slightly worse overall, yielding an F1-score of 72.52%, primarily due to lower precision (97.42%) and recall (66.69%). The incorporation of a BiLSTM layer leads to mixed results; the BERT-BiLSTM-CRF model achieves an F1-score of 74.39%, with a recall of 66.27%, indicating marginal improvement in sequence modeling but limited gains in recall. Notably, the secBERT-BiLSTM-CRF model records a recall of 80.31%—the highest among the baselines—but its precision drops to 97.31%, resulting in an overall F1-score of 68.05%. This pattern implies that while BiLSTM enhances the model’s ability to capture sequential dependencies, it also introduces more false positives. As shown in Figure 4, this effect can be observed more intuitively through the comparative analysis of prediction outcomes.

In contrast, the proposed XLNet-CRF model consistently outperforms these baselines by leveraging superior contextual understanding, positional flexibility, and robust subtoken integration. For instance, in the sentence “Some victims of Operation Manul have expressed a desire to preserve their anonymity, which we respect”, BERT-CRF fails to identify “Operation Manul” as a malware.ransom entity, while XLNet-CRF correctly labels the two-token span. This can be attributed to XLNet’s permutation-based pretraining, which enables each token to attend to all surrounding positions, thereby capturing cohesive representations of multi-word expressions. Moreover, its use of relative positional encoding allows for dynamic modeling of inter-token relationships, even when tokens are non-contiguous—something absolute positional encodings often fail to achieve.

Although XLNet-CRF demonstrates robust aggregate performance, its effectiveness may be uneven across entity types due to several underlying factors.

First, the CTI-Reports dataset is heavily skewed toward dominant categories such as hash and malware.backdoor, while minority classes like malware.ransom and url.normal are sparsely represented. This imbalance causes the model to learn more stable decision boundaries for frequent classes, while underperforming on rare categories.

Second, entity types differ in structural and contextual cues. Structured identifiers such as IPv4 addresses (e.g., 192.168.100.45) consist of digits and punctuation with minimal semantic context, making them difficult to identify. In contrast, malware entities often appear alongside technical terms (e.g., “C2 node”, “proxy infrastructure”), which offer stronger contextual signals.

Third, the tokenization process may fragment certain structured entities into multiple subtokens. For example, a tokenizer might split 192.168.100.45 into tokens such as “192”, “.”, “168”, etc., dispersing its semantic content and potentially hindering coherent representation and recognition.

These factors may contribute to observed performance differences across entity types, even when aggregate metrics remain high. In particular, the model may perform more consistently on frequent and semantically rich entities, while exhibiting limitations when dealing with infrequent, structurally complex, or token-fragmented instances.

4.3.2. Result on MalwareTextDB

As shown in Table 5, model performance on the MalwareTextDB dataset varies significantly, with XLNet-CRF achieving the most balanced results—precision of 85.33%, recall of 92.24%, and an F1-score of 88.65%—demonstrating its effectiveness in capturing long-range and bidirectional context. In contrast, BERT-CRF, despite attaining high precision (87.76%), exhibits extremely low recall (47.39%), resulting in a substantially reduced F1-score (58.57%). This disparity highlights the limitations of masked-language modeling in scenarios where complex entity spans must be captured under limited annotation coverage.

A distinctive feature of MalwareTextDB is its small label set—comprising only three entity types (Entity, Action, and Modifier)—paired with dense and overlapping annotations. For example, in the annotated span “... they are (B-Action) infected (I-Action) with (B-Modifier) a rare APT Trojan (B/I/I/I-Entity) posing (B-Action) as (B-Modifier) any one of several major software releases (B/I/I/I/I/I-Entity)”, the model must accurately capture long, nested, and semantically entangled spans. XLNet-CRF’s permutation-based pretraining enables it to attend to bidirectional context and encode multi-token dependencies, thereby supporting coherent span boundary detection for complex constructions such as modifier–action–entity triplets. In contrast, BERT-based models—particularly those without domain-specific adaptation—tend to produce fragmented or incomplete predictions due to their limited global contextual reasoning.

Integrating BiLSTM into BERT (i.e., BERT-BiLSTM-CRF) maintains comparable precision (85.59%) but further reduces recall to 38.92%, as shown in Figure 5. This decline may result from overfitting on limited training samples or interference between the sequential encoder (BiLSTM) and contextual embeddings. The domain-adapted secBERT-CRF slightly improves recall (57.16%) and F1-score (62.53%) over BERT-CRF, reflecting the advantage of exposure to security-specific terminology. However, secBERT-BiLSTM-CRF exhibits inconsistent performance; while its recall peaks at 69.72%, the F1-score drops to 47.07%, suggesting that the combination of domain adaptation and architectural complexity may hinder overall model stability.

4.3.3. Result on DNRTI

As shown in Table 5, XLNet-CRF underperforms on the DNRTI dataset (precision 82.86%, recall 81.92%, F1 82.39%) relative to its BERT-based counterparts. In particular, BERT-CRF achieves the highest F1 score (90.02%), and secBERT-CRF strikes an excellent balance between precision (96.00%) and recall (88.80%), reflecting the benefits of domain-specific pretraining. By contrast, XLNet-CRF’s permutation-based objective—while powerful for capturing long-range dependencies—appears less adept at modeling DNRTI’s compact, specialized entity types (e.g., email-style identifiers and short multi-word spans).

A closer examination reveals that XLNet-CRF’s permutation-based objective—although effective for capturing long-range dependencies—may be less suited to DNRTI’s concise, high-density entity types, such as email-style identifiers and short multi-word expressions. For instance, in the sentence “The admin@338 started targeting Hong Kong media companies, probably in response to political and economic challenges in Hong Kong and China”, BERT-CRF correctly labels “admin@338” (HackOrg), “Hong Kong” (Area), and “media companies” (Org), whereas XLNet-CRF tends to fragment or misclassify these spans. This discrepancy can be attributed to several factors; BERT’s WordPiece tokenizer—especially when domain-adapted—more effectively handles email-like tokens, and its masked language modeling objective promotes contiguous span prediction. In contrast, XLNet’s permutation-based modeling may assign less weight to short, information-dense sequences. Moreover, its general-purpose tokenizer can split coherent tokens into fragmented subtokens (e.g., “admin”, “##@338”), diluting semantic cues and hindering accurate CRF decoding. These findings suggest that further adaptation may be necessary for XLNet-CRF to handle short, domain-specific entities effectively.

Furthermore, the integration of BiLSTM layers into BERT-based architectures offers only marginal improvements on DNRTI, as shown in Figure 6, suggesting that sequential modeling provides limited additional value for this dataset’s compact and domain-specific entity structures.

5. Discussion

The experimental results across the three cybersecurity datasets reveal distinct performance patterns shaped by the interplay between dataset characteristics and model architecture. XLNet-CRF consistently achieves strong overall results on the CTI-Reports and MalwareTextDB datasets, where its permutation-based pretraining proves effective in modeling long-range dependencies and capturing complex entity structures. However, its performance declines on the DNRTI dataset, underscoring that model effectiveness is not universal and depends heavily on factors such as pretraining objectives, and entity type characteristics.

On CTI-Reports, XLNet-CRF attains the most balanced performance, surpassing BERT-based baselines in both recall and F1-score. This advantage stems from three core strengths: (1) its permutation-based training objective facilitates rich contextual representation, (2) its relative positional encoding supports flexible cross-sentence modeling, and (3) its robustness in reconstructing subword-tokenized entities enhances span-level coherence. For example, XLNet-CRF successfully identifies multi-token entities like “Operation Manual”, which BERT-CRF fails to capture, due to its capacity to model noncontiguous dependencies and integrate dispersed semantic cues. Nevertheless, a closer inspection reveals variability across entity categories. Lexically consistent types (e.g., hash) are recognized accurately, while rare or morphologically complex entities (e.g., IPv4 addresses, url.normal) are more error-prone. These discrepancies are likely influenced by class imbalance, subtoken fragmentation, and the limited contextual cues associated with structurally sparse entities.

In MalwareTextDB, which contains only three entity types but features dense, nested, and overlapping annotations, XLNet-CRF again delivers superior results. Its bidirectional context modeling proves particularly effective for extracting structurally interwoven patterns, such as modifier–action–entity triplets. BERT-based models, while exhibiting high precision, suffer from low recall, reflecting a tendency to miss multi-token spans under sparse supervision. Notably, incorporating BiLSTM layers into BERT architectures results in unstable performance—potentially due to overfitting on limited samples or interference between sequential and contextual encoders—suggesting that added complexity does not necessarily translate into better generalization, especially when entity boundaries are fluid or hierarchical.

Conversely, this advantage does not carry over to the DNRTI dataset. Here, XLNet-CRF underperforms relative to both general and domain-adapted BERT variants. The dataset’s compact and lexically specific entity types—such as email-like identifiers (“admin@338”) and short multi-word phrases (“media companies”)—are better accommodated by BERT’s masked language modeling and WordPiece tokenizer. Domain-adapted variants like secBERT-CRF benefit further from exposure to security-related terminology. XLNet-CRF, by contrast, struggles with fragmented subtoken representations and places less emphasis on contiguous spans, resulting in diminished span-level consistency. This highlights a potential limitation of permutation-based modeling in domains characterized by short, lexically rigid, and structurally compact entities.

Collectively, these findings suggest that no single architecture universally outperforms others across all cybersecurity NER tasks. XLNet-CRF is better suited to scenarios involving long, nested, or semantically diffuse spans, while BERT-based models—especially those augmented with domain adaptation—excel in recognizing short, dense, and lexically distinctive entities. Future research may benefit from hybrid approaches that integrate the strengths of both paradigms, such as combining permutation-based attention with domain-specific tokenization or leveraging adapter modules for adaptive fine-tuning in varied CTI environments.

6. Conclusions and Future Work

This study presents a NER-based framework for CTI analysis, aimed at improving the efficiency of entity extraction from unstructured threat data. By incorporating a two-stream self-attention mechanism, the proposed XLNet-CRF model effectively captures long-range dependencies and contextual semantics. The integration of a CRF layer further enhances label consistency, enabling accurate recognition of key security entities such as vulnerabilities, attack techniques, and affected assets. Experimental results on benchmark datasets—including CTI-Reports and MalwareTextDB—demonstrate the model’s robustness in balancing contextual understanding with structured entity recognition.

Nonetheless, performance degradation is observed on the DNRTI dataset, which is characterized by domain-specific terminology and compact entity spans. This limitation highlights the challenge of adapting pretrained models to specialized cybersecurity domains. While XLNet’s permutation-based pretraining captures bidirectional dependencies effectively, it may require further adaptation to accommodate domain-specific linguistic structures and short-text contexts.

In future work, we will first focus on enhancing the domain adaptability of XLNet through targeted fine-tuning on cybersecurity-specific corpora. This will help mitigate the performance limitations observed on short or terminology-heavy CTI texts and further improve entity recognition accuracy in low-resource subcategories.

Building upon improved NER outputs, we then plan to integrate the extracted entities into a structured cyber threat knowledge graph to support downstream reasoning and analytical tasks. By linking entities such as malware names, IP addresses, and vulnerabilities into a graph-based representation, we aim to explicitly model relationships among threat actors, tactics, and indicators. This integration will support cross-document entity linking, enrich sparse threat contexts, and enable advanced CTI applications such as threat correlation, actor profiling, and predictive defense strategy development.

Author Contributions

Conceptualization, Y.L.; methodology, T.W.; validation, T.W., C.L. and H.L.; investigation, Y.L., C.L. and B.W.; writing—original draft preparation, T.W. and C.L.; writing—review and editing, B.W. and H.L.; supervision, Y.L., B.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Source codes and desensitized datasets are available in the GitHub repository: https://github.com/wth89/Efficient-Named-Entity-Recognition-for-Cyber-Threat-Intelli-gence-with-Permutation-Language-Modeling (accessed on 29 May 2025).

Conflicts of Interest

Author Hongri Liu was employed by the company Weihai Cyberguard Technologies Co. Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Alaeifar, P.; Pal, S.; Jadidi, Z.; Hussain, M.; Foo, E. Current approaches and future directions for Cyber Threat Intelligence sharing. Cybersecurity 2024, 83, 103786. [Google Scholar] [CrossRef]
Ainslie, S.; Thompson, D.; Maynard, S.; Ahmad, A. Cyber-threat intelligence for security decision-making: A review and research agenda for practice. Comput. Secur. 2023, 132, 103352. [Google Scholar] [CrossRef]
Sun, N.; Ding, M.; Jiang, J.; Xu, W.; Mo, X.; Tai, Y.; Zhang, J. Cyber Threat Intelligence Mining for Proactive Cybersecurity Defense: A Survey and New Perspectives. IEEE Commun. Surv. Tutor. 2023, 25, 1748–1774. [Google Scholar] [CrossRef]
Zhao, J.; Yan, Q.; Li, J.; Shao, M.; He, Z.; Li, B. TIMiner: Automatically Extracting and Analyzing Categorized Cyber Threat Intelligence from Social Data. Comput. Secur. 2020, 95, 101867. [Google Scholar] [CrossRef]
Skylaki, S.; Oskooei, A.; Bari, O.; Herger, N.; Kriegeman, Z. Named Entity Recognition in the Legal Domain using a Pointer Generator Network. arXiv 2020, arXiv:2012.09936. [Google Scholar] [CrossRef]
Leitner, E.; Rehm, G.; Moreno-Schneider, J. Fine-Grained Named Entity Recognition in Legal Documents. In Semantic Systems. The Power of AI and Knowledge Graphs; Acosta, M., Cudré-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., Sure-Vetter, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11702, pp. 272–287. ISBN 978-3-030-33219-8. [Google Scholar]
Derczynski, L.; Maynard, D.; Rizzo, G.; van Erp, M.; Gorrell, G.; Troncy, R.; Petrak, J.; Bontcheva, K. Analysis of Named Entity Recognition and Linking for Tweets. Inf. Process. Manag. 2015, 51, 32–49. [Google Scholar] [CrossRef]
Rocktäschel, T.; Weidlich, M.; Leser, U. ChemSpot: A Hybrid System for Chemical Named Entity Recognition. Bioinformatics 2012, 28, 1633–1640. [Google Scholar] [CrossRef] [PubMed]
Atdag, S.; Labatut, V. A Comparison of Named Entity Recognition Tools Applied to Biographical Texts. In Proceedings of the 2nd International Conference on Systems and Computer Science, Villeneuve d’Ascq, France, 26–27 August 2013; pp. 228–233. [Google Scholar]
Chang, J.; Han, X. Multi-level context features extraction for named entity recognition. Comput. Speech Lang. 2023, 81, 101412. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Ezen-Can, A. A Comparison of LSTM and BERT for Small Corpus. arXiv 2020, arXiv:2009.05451. [Google Scholar] [CrossRef]
Chen, S.-S.; Hwang, R.-H.; Sun, C.-Y.; Lin, Y.-D.; Pai, T.-W. Enhancing Cyber Threat Intelligence with Named Entity Recognition Using BERT-CRF. In Proceedings of the GLOBECOM 2023-2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4 December 2023; pp. 7532–7537. [Google Scholar] [CrossRef]
Zaman, G.; Mahdin, H.; Hussain, K.; Rahman, A.; Abawajy, J.; Mostafa, S.A. An Ontological Framework for Information Extraction From Diverse Scientific Sources. IEEE Access. 2021, 9, 42111–42124. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Mulwad, V.; Li, W.; Joshi, A.; Finin, T.; Viswanathan, K. Extracting Information about Security Vulnerabilities from Web Text. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence 2011, Lyon, France, 22–27 August 2011. [Google Scholar] [CrossRef]
Chen, S.F.; Rosenfeld, R. A Gaussian Prior for Smoothing Maximum Entropy Models. Tech. Rep. 1999. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, 28 June 2001; pp. 282–289. [Google Scholar] [CrossRef]
Ma, P.; Jiang, B.; Lu, Z.; Li, N.; Jiang, Z. Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields. Tsinghua Sci. Technol. 2021, 26, 259–265. [Google Scholar] [CrossRef]
Srivastava, S.; Paul, B.; Gupta, D. Study of Word Embeddings for Enhanced Cyber Security Named Entity Recognition. Procedia Comput. Sci. 2023, 218, 3104–3111. [Google Scholar] [CrossRef]
Xiang, G.; Shi, C.; Zhang, Y. An APT Event Extraction Method Based on BERT-BiGRU-CRF for APT Attack Detection. Electronics 2023, 12, 3349. [Google Scholar] [CrossRef]
Xie, T.; Yang, J.; Liu, H. Chinese entity recognition based on BERT-BiLSTM-CRF model. Comput. Syst. Appl. 2020, 29, 48–55. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.-H.; Li, H.; Huang, W.-J. Named Entity Recognition in Threat Intelligence Domain Based on Deep Learning. J. Northeast. Univ. (Nat. Sci.) 2023, 44, 33–39. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, L. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Wang, X.; Liu, X.; Ao, S.; Li, N.; Jiang, Z.; Xu, Z. DNRTI: A Large-Scale Dataset for Named Entity Recognition in Threat Intelligence. In Proceedings of the 19th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December 2020–1 January 2021; pp. 1517–1524. [Google Scholar] [CrossRef]
Nlpai-Lab. CTI-Reports-Dataset. Available online: https://github.com/nlpai-lab/CTI-reports-dataset (accessed on 16 March 2025).
Lim, S.K.; Muis, A.O.; Lu, W.; Ong, C. MalwareTextDB: A Database for Annotated Malware Articles. Proc. ACL. 2017, 55, 1557–1567. [Google Scholar] [CrossRef]

Figure 1. Workflow of XLNet’s permutation language modeling (PLM). Tokens are predicted autoregressively based on a random permutation, allowing bidirectional context learning without using mask tokens. The green-shaded area denotes the predicted content.

Figure 2. Two-stream attention mechanism in XLNet. (a) Content stream: standard self-attention using the token’s own embedding and full preceding context. (b) Query stream: attends only to content stream outputs of previous positions, excluding the current token itself to avoid information leakage.

Figure 3. Architecture of the proposed XLNet-CRF model, combining permutation-based Transformer encoding with a CRF layer for structured entity prediction.

Figure 4. Precision, recall, and F1-score of different models in CTI-Reports dataset.

Figure 5. Precision, recall, and F1-score of different models in MalwareTextDB dataset.

Figure 6. Precision, recall, and F1-score of different models in DNRTI dataset.

Table 1. Distribution and frequency of entity labels in the DNRTI dataset.

Dataset	Label Category	Quantity	Percentage
DNRTI	HackOrg	5465	14.8%
	Tool	4784	13.0%
	Area	3447	9.4%
	OffAct	2669	7.3%
	Time	2659	7.2%
	Org	2489	6.8%
	Features	2441	6.6%
	Purp	2424	6.6%
	SamFile	2400	6.5%
	Idus	2136	5.8%
	Way	2018	5.5%
	Exp	1959	5.3%
	SecTeam	1921	5.2%

Table 2. Distribution and frequency of entity labels in the CTI-Reports dataset.

Dataset	Label Category	Quantity	Percentage
CTI-Reports	hash	1849	21.8%
	malware.backdoor	1715	20.2%
	url.unknown	1431	16.9%
	malware.infosteal	825	9.7%
	url.cncsvr	574	6.8%
	ip.unknown	541	6.4%
	malware.drop	487	5.7%
	malware.unknown	480	5.7%
	url.normal	383	4.5%
	malware.ransom	195	2.3%

Table 3. Distribution and frequency of entity labels in the MalwareTextDB dataset.

Dataset	Label Category	Quantity	Percentage
MalwareTextDB	Entity	24,512	81.7%
	Action	3583	11.9%
	Modifier	1920	6.4%

Table 4. Training configurations for the XLNet-CRF model across the three cybersecurity datasets.

	CTI-Reports	MalwareTextDB	DNRTI
Training batch size	32	32	64
Training epoch	60	80	90
Learning rate for XLNet	5 × 10⁻⁵
Learning rate for CRF	8 × 10⁻⁵
weight decay for XLNet	1 × 10⁻⁵
weight decay for CRF	5 × 10⁻⁶
Maximum sequence length	256

Table 5. The comparison of the models.

Dataset	Model	Precision	Recall	F1-Score
CTI-Reports	XLNet-CRF	97.41%	97.46%	97.43%
	BERT-CRF	98.37%	74.10%	77.29%
	BERT-BiLSTM-CRF	97.44%	66.27%	74.39%
	secBERT-CRF	97.42%	66.69%	72.52%
	secBERT-BiLSTM-CRF	97.31%	80.31%	68.05%
MalwareTextDB	XLNet-CRF	85.33%	92.24%	88.65%
	BERT-CRF	87.76%	47.39%	58.57%
	BERT-BiLSTM-CRF	85.59%	38.92%	45.59%
	secBERT-CRF	87.68%	57.16%	62.53%
	secBERT-BiLSTM-CRF	85.59%	69.72%	47.07%
DNRTI	XLNet-CRF	82.86%	81.92%	82.39%
	BERT-CRF	96.36%	88.59%	90.02%
	BERT-BiLSTM-CRF	94.84%	85.03%	84.59%
	secBERT-CRF	96.00%	88.80%	88.62%
	secBERT-BiLSTM-CRF	94.52%	84.48%	83.77%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, T.; Liu, Y.; Liang, C.; Wang, B.; Liu, H. XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling. Electronics 2025, 14, 3034. https://doi.org/10.3390/electronics14153034

AMA Style

Wang T, Liu Y, Liang C, Wang B, Liu H. XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling. Electronics. 2025; 14(15):3034. https://doi.org/10.3390/electronics14153034

Chicago/Turabian Style

Wang, Tianhao, Yang Liu, Chao Liang, Bailing Wang, and Hongri Liu. 2025. "XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling" Electronics 14, no. 15: 3034. https://doi.org/10.3390/electronics14153034

APA Style

Wang, T., Liu, Y., Liang, C., Wang, B., & Liu, H. (2025). XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling. Electronics, 14(15), 3034. https://doi.org/10.3390/electronics14153034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling

Abstract

1. Introduction

2. Related Work

2.1. Traditional Machine Learning Approaches

2.2. RNN-Based Deep Learning Approaches

2.3. Transformer-Based Pretrained Models

3. Methodology

3.1. CRF

3.2. XLNet Model

3.2.1. Permutation Language Modeling

3.2.2. Two-Stream Self-Attention

4. Experiments and Results

4.1. Dataset Description

4.1.1. DNRTI Dataset

4.1.2. CTI-Reports Dataset

4.1.3. MalwareTextDB Dataset

4.2. Experimental Setup and Evaluation Metrics

4.3. Experimental Results and Analysis

4.3.1. Result on CTI-Reports

4.3.2. Result on MalwareTextDB

4.3.3. Result on DNRTI

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI