Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction

Ding, Junmei; Lu, Yueming

doi:10.3390/electronics14153096

Open AccessArticle

Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction

by

Junmei Ding

^*

and

Yueming Lu

School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3096; https://doi.org/10.3390/electronics14153096

Submission received: 30 June 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 3 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Cyber threat intelligence summarization plays a critical role in enhancing threat awareness and operational response in cybersecurity. However, existing summarization models often fail to capture essential threat elements due to the unstructured nature of cyber threat intelligence documents and the lack of domain-specific knowledge. This paper presents a knowledge-guided cyber threat intelligence summarization framework via term-oriented input construction, designed to improve summary fidelity, semantic relevance, and model robustness. The proposed approach consists of two key components: a hybrid term construction pipeline that combines unsupervised keyword extraction and supervised term generation with rule-based refinement, and a knowledge-injected input construction paradigm that explicitly incorporates structured terms into the model input. This strategy enhances the model’s understanding of critical threat semantics without altering its architecture. Extensive experiments conducted on cyber threat intelligence summarization benchmarks under both zero-shot and supervised settings demonstrate that the proposed method consistently improves summarization performance across different models, offering strong generalization and deployment flexibility.

Keywords:

domain-specific summarization; terminology-aware generation; zero-shot learning; fine-tuning

1. Introduction

Cyber threat intelligence (CTI) plays a foundational role in modern cybersecurity operations by providing timely insights into threat actors, attack techniques, and potential risks. It supports proactive defense, incident response, and strategic planning. However, CTI documents are often lengthy, unstructured, and densely populated with technical jargon, making them difficult to interpret efficiently. This underscores the growing need for automatic summarization systems that can generate concise, faithful, and domain-relevant summaries to support rapid situational awareness and decision-making.

Unlike generic-domain summarization tasks—such as those in the news [1] or scientific literature [2]—CTI reports exhibit several characteristics that pose several unique and domain-specific factors for automatic summarization. First, they are marked by a high density of domain-specific terminology and threat indicators (e.g., C2 server, APT group, remote access Trojan), which are critical for conveying actionable semantics and contextualizing threat behavior. Second, these documents typically suffer from weak structural organization and high redundancy, often lacking standardized formatting and containing excessive background information, duplicated content, or loosely structured narratives, which obscure the core threat elements. Third, key information is frequently dispersed throughout the text or presented implicitly, requiring deeper semantic understanding to accurately capture and summarize the most relevant threat-related events and indicators. As a result, building CTI summarization systems that can accurately identify, extract, and express core threat elements remains an open and critical research problem.

Recent progress in automatic summarization—particularly with pre-trained models such as PGNet [3], BART [4], T5 [5], and BRIO [6]—has led to substantial performance improvements on general-domain benchmarks. These advances have also inspired research into domain-specific summarization, where specialized models are used to enhance factuality and clarity in legal [7], medical [8], and cybersecurity domains [9]. In particular, CTISum [9] introduces two tasks—Cyber Threat Intelligence Summarization (CTIS) and Attack Process Summarization (APS)—and serves as a valuable and challenging benchmark for evaluating summarization systems in the cybersecurity domain.

Despite this progress, several underexplored challenges remain in the CTI domain. Most large language models (LLMs) are trained on general-domain corpora and lack intrinsic awareness of cybersecurity-specific terminology or event structure, leading to summaries that omit key information or use imprecise language. Moreover, current methods often process CTI documents as unstructured inputs without injecting task-relevant knowledge, which can result in hallucinated or incomplete outputs. Furthermore, the role of structured domain terminology as a guiding signal in CTI summarization remains poorly understood, with limited study of how term quality, placement, and integration strategies influence generation performance. These challenges also suggest that, beyond improving coverage and relevance, enhancing the factual consistency of generated summaries is important for practical deployment. In particular, we observe that existing models may omit critical threat details or introduce minor factual inconsistencies—issues sometimes associated with limited summary fidelity or instances of hallucination.

Recent cyber threats have grown increasingly sophisticated, encompassing complex vectors such as side-channel attacks that exploit hardware vulnerabilities and covert communication channels. Prior studies [10,11,12,13,14,15] highlight these emerging threats across mobile, IoT, and wireless domains, underscoring the expanding complexity and subtlety of attack techniques. Such developments significantly increase the semantic and contextual challenges for CTI summarization, necessitating adaptive methods capable of integrating evolving domain knowledge and defense mechanisms to ensure accurate and relevant intelligence generation.

To bridge these gaps, a knowledge-guided CTI summarization framework is proposed, which introduces a term-oriented input construction approach. The framework operates in two stages: (1) a hybrid domain term construction pipeline, which combines unsupervised keyword extraction and supervised term generation with rule-based refinement to produce a high-quality term set; and (2) a knowledge-injected input construction paradigm, where structured terms are explicitly incorporated into the summarization input to enhance model awareness of threat semantics. This architecture-agnostic strategy can be applied in both zero-shot and fine-tuning settings, requiring no modifications to existing model structures and enabling flexible deployment across various model families.

Our approach differs from prior terminology-guided summarization methods in several key aspects. (1) Domain adaptive without external resources: unlike many biomedical or legal summarization systems that rely on curated ontologies or knowledge bases, our method extracts and refines terminology solely from raw CTI texts through a hybrid unsupervised-supervised process, enabling adaptability to dynamic and unknown cyber threat environments; (2) systematic multi-factor input strategy: we thoroughly investigate multiple factors including term injection position (prefix, postfix and Inline-Before), term set size (top-k), input length sensitivity and training data proportion—analyses rarely explored collectively in prior work; (3) focused application on emerging cybersecurity domain: leveraging the recently published CTISum benchmark [9], our experiments validate cross-task generalization and practical utility in real-world threat intelligence summarization scenarios.

The main contributions are summarized as follows:

A novel knowledge-guided CTI summarization via term-oriented input construction is introduced to enhance threat comprehension, improve summary relevance and fidelity, and reduce hallucination risks through structured term integration.
A hybrid domain term construction pipeline is developed, combining unsupervised and supervised strategies with rule-based postprocessing to generate high-quality, semantically rich term sets.
A knowledge-injected input construction paradigm is introduced, in which high-quality structured terms are explicitly integrated into model inputs to guide attention to key threat concepts, improving summarization performance without modifying the model architecture.
Our method is evaluated under both zero-shot and supervised conditions on multiple model backbones (e.g., LLAMA-3, GPT-3.5, BART), demonstrating strong generalization, robustness, and architecture compatibility.

2. Related Work

2.1. Automatic Summarization

Neural automatic summarization has seen significant progress with the advent of large-scale pre-trained models, such as BART [4], T5 [5], and PEGASUS [16]. These models are typically trained on general-domain corpora and fine-tuned on task-specific datasets to generate fluent and informative summaries. While effective for generic summarization tasks (e.g., news or scientific papers), these models often struggle with factual consistency and domain-specific semantic accuracy [17].

To improve the relevance and faithfulness of generated summaries, several methods have explored reinforcement learning [18], contrastive learning [6], and factual consistency constraints [19]. However, these approaches still largely treat input documents as flat sequences and often overlook structural and semantic cues critical in specialized domains. In cybersecurity, where core information is often buried in unstructured text and presented with technical jargon, traditional summarization models exhibit limitations in precision and faithfulness.

2.2. Domain Term Extraction

Domain terminology extraction is an essential task for understanding technical text in specialized fields such as medicine, law, and cybersecurity. Traditional methods rely on statistical co-occurrence measures (e.g., TF-IDFTF-IDF [20], TextRank [21]) or linguistic patterns (e.g., part-of-speech filters, noun phrases). Recent advances integrate neural models such as BERT [22] and span-based tagging architectures for more accurate and contextualized term recognition [23].

In the cybersecurity domain, domain terms include malware names, TTPs, CVE identifiers, and threat actor aliases. Due to the dynamic and evolving nature of threats, constructing comprehensive and up-to-date term dictionaries is challenging. Some works have explored bootstrapping techniques [24] or distant supervision [25] to expand domain term coverage. Nevertheless, how to effectively inject such extracted terms into downstream tasks like summarization remains underexplored.

2.3. Terminology-Oriented Summarization Methods

While traditional summarization models rely solely on sequence modeling and attention mechanisms to identify salient content, emerging research has begun exploring the integration of external knowledge—especially domain-specific terminology—to guide the generation process. Terminology-oriented summarization aims to improve the factuality, faithfulness, and informativeness of summaries by explicitly incorporating key concepts from the input domain.

One line of work leverages entity- or concept-guided decoding [26,27], where domain-specific entities are injected either during input encoding or decoding to constrain generation. In biomedical and legal domains, studies have shown that providing structured term inputs can significantly reduce hallucination and improve informativeness. These approaches often depend on high-quality term annotation or knowledge graph construction, which may not be readily available or scalable in dynamic domains like cybersecurity. In cybersecurity, terminology plays a critical role in defining the threat context, such as malware types, TTPs (tactics, techniques, procedures), and threat actor names. However, few summarization systems have been designed to explicitly incorporate such structured knowledge. Peng et al. [9] discuss domain-aware summarization in CTI but stop short of leveraging structured terminology in the generation process.

This work fills this gap by proposing a terminology-oriented input construction strategy, which directly incorporates structured domain terms—identified through a hybrid domain term construction pipeline—into the input of pre-trained summarization models. This strategy guides the model to attend to salient threat-related concepts and improves the semantic alignment between the input and generated summary. Moreover, it does so without requiring architecture modification, making the approach model-agnostic and broadly applicable across both zero-shot and supervised settings.

3. Approach

An overview of the proposed CTI summarization framework is illustrated in Figure 1, which comprises two interdependent components. First, the Hybrid Domain Term Construction Pipeline identifies and generates high-quality, task-relevant terminology from CTI documents by combining unsupervised extraction with supervised generation techniques. Second, the Knowledge-Injected Input Composition Paradigm integrates the generated term set with the original document to form a composite input, which is then fed into either a zero-shot LLM or a fine-tuned summarization model. This unified framework enhances the model’s ability to recognize, prioritize, and accurately articulate critical threat semantics, leading to more faithful and informative summaries.

3.1. Hybrid Domain Term Construction Pipeline

To enhance the quality and domain relevance of CTI summarization, we propose a hybrid domain term construction pipeline that combines unsupervised keyword extraction and supervised term generation. This pipeline aims to automatically build a clean, semantically consistent, and structurally standardized set of domain-specific terms, which serves as a lightweight knowledge prior for downstream summarization without requiring manual annotation.

3.1.1. Unsupervised Term Extraction

Given a raw CTI document X, a keyword extraction algorithm

E

—such as TF-IDF [20], TextRank [21], KeyBERT [28], or YAKE [29]—is applied to identify salient terms. This process yields an initial candidate term set denoted as

\tilde{T} = E (X)

(1)

3.1.2. Supervised Term Generation

Since some critical terms in CTI summaries are not explicitly present in the original text, relying solely on unsupervised extraction may lead to omissions. To address this, we design a supervised term generation model that learns to bridge the semantic gap between raw CTI content and target domain terms.

Prior to model training, we introduce a multi-step refinement procedure on

\tilde{T}

to produce a high-quality term set

T_{refine}

through the following steps:

POS-based Filtering: Remove function words and structurally irrelevant items (e.g., “a”, “the”, “to”, “from”, “with”, “1”, “2”, “she”, “he”);
Character Cleaning: Normalize casing and eliminate non-English artifacts (e.g., “-”, “=”, “?”);
Redundancy Removal: De-duplicate semantically or morphologically similar terms and discard low-information items (e.g., “document”, “files”; “APT attack” and “APT attack group” are merged).

The following example, drawn from a CTI document, demonstrates the term refinement process:

“The threat actor used a macro-embedded document to deliver a PowerShell-based payload. Once executed, the script downloaded additional components from Dropbox and established communication with a command-and-control server. It then collected system information and exfiltrated sensitive files.”

Raw extracted terms: [macro, document, to, PowerShell, -, payload, executed, the, script, from, Dropbox, communication, with, command-and-control, server, collected, and, information, exfiltrated, files]

Refined term set: [macro, PowerShell, payload, executed, script, Dropbox, communication, command-and-control, collected, exfiltrated]

In this example, POS-based filtering removes function words such as “to”, “the”, “from,” and “with”; character cleaning eliminates non-informative symbols like “-”; and redundancy removal filters out generic terms such as “document”, “information”, “files”, and “server.” This refinement process illustrates how noisy or overly generic terms can be effectively filtered while preserving core threat semantics, thereby producing more informative and targeted inputs for summarization. Building upon this refined term set

T_{ref}

, we train the term generation model

G

, which is implemented as a Seq2Seq architecture (e.g., T5 [5] or BART [4]). It learns to map from raw CTI content X to a domain-specific term list

T_{gen}

.

T_{gen} = G (X)

(2)

The model is optimized using sequence-level cross-entropy loss:

L_{term} = CE (G (X), T_{ref})

(3)

Compared to

\tilde{T}

or

T_{ref}

, the generated term set

T_{gen}

demonstrates improved semantic precision and stronger alignment with task-specific concepts. These terms are subsequently used as auxiliary inputs to guide the model’s attention toward core threat semantics. This results in a lightweight, fully automatic domain enhancement strategy that improves guided summarization without requiring any manual effort.

3.2. Knowledge-Injected Input Composition Paradigm

To further enhance the model’s focus on critical threat information and improve the quality of summary generation, we propose a knowledge-injected input construction paradigm that incorporates structured domain-specific terminology into the model input as lightweight knowledge guidance. This approach helps the model better understand cybersecurity semantics and improves the coverage, relevance, and factual consistency of the generated summaries.

3.2.1. Knowledge-Injected Input Modeling

The structured input

X^{*}

is constructed by prepending the generated term set

T_{gen}

to the original CTI content X:

X^{*} = T_{gen} \oplus X

(4)

Here, ⊕ denotes the fusion operation between the term set and the document, implemented by formatting the terms as a semicolon-separated list ending with a period, followed by the original CTI content (e.g., “malware; phishing; C2 server. CTI content”). This input format explicitly injects high-value threat-related terms into the input, guiding the model’s attention toward critical information during the summarization process.

3.2.2. Model Training or Inference Objective

The summarization model

M

takes the structured input

X^{*}

and generates the final summary Y:

Y = M (X^{*})

(5)

During supervised training, the model is optimized by minimizing the cross-entropy loss between the generated summary and the gold reference summary

Y_{ref}

:

L_{summ} = CrossEntropy (Y, Y_{ref})

(6)

In the zero-shot setting, the model performs inference using the term-injected input without additional training:

Y = M (X^{*}) (inference only)

(7)

By injecting high-quality, domain-specific terms as lightweight knowledge directly into the input, the model is guided to better recognize and attend to critical threat concepts. This enhances factual coverage, improves relevance to the original CTI content, and promotes more faithful summary generation. Notably, this strategy enables effective knowledge integration without altering the model architecture, making it both lightweight and adaptable to zero-shot and supervised scenarios.

4. Experiment

4.1. Dataset

Our proposed method, along with several competitive baselines, is evaluated using the CTISum dataset [9], a benchmark specifically constructed to support the research, development, and evaluation of CTI summarization techniques. Covering a broad spectrum of cyber threats, attack behaviors, and security incidents from 2016 to 2023 across various regions and threat categories, CTISum serves as a representative and high-quality testbed for the cybersecurity domain.

The dataset aggregates intelligence from diverse sources, including open-source repositories (e.g., APTnotes), threat encyclopedias (e.g., TrendMicro), and vendor-specific threat reports (e.g., Symantec). This multi-source integration yields a semantically rich and contextually diverse corpus that captures the evolving landscape of cyber threats, offering valuable training and evaluation resources for CTI summarization models.

CTISum contains 1345 documents with an average input length of 2865 characters and an average summary length of 200 characters, yielding a compression ratio of approximately 14.32, which underscores the difficulty of generating concise and high-quality summaries from lengthy CTI texts. The dataset supports two summarization tasks: CTIS and APS, with APS defined as a subset consisting of 1014 annotated samples.

The two tasks differ in summarization focus and complexity. The CTIS task focuses on summarizing high-level threat intelligence, such as actor names, attack types, targeted sectors, and overall threat context. This task emphasizes coverage and readability of general threat narratives. In contrast, the APS task requires the extraction of fine-grained procedural information, including tactics, techniques, and indicators of compromise (IoCs). It demands greater factual precision and domain-specific understanding. As a result, APS requires greater challenges in both informativeness and faithfulness.

4.2. Baselines

To evaluate the effectiveness of the proposed method, we conduct comparative experiments against the following representative baselines:

GPT-3.5 [30]: A powerful generative pre-trained transformer model developed for conversational interactions. It is also capable of handling a wide range of complex natural language processing tasks, such as text generation, question answering, and summarization.
LLaMA3-8B [31]: An auto-regressive language model built on an optimized transformer architecture, designed for various natural language generation tasks with improved efficiency and scalability.
Vicuna-13B [32]: A conversational assistant model based on LLaMA2, fine-tuned using a diverse set of user-shared conversations to enhance dialogue understanding and response quality.
T5-base [5]: A unified text-to-text transformer model that reformulates all NLP tasks into a text generation format, demonstrating strong performance across summarization, translation, and classification tasks.
BART-base [4]: A denoising autoencoder for pretraining sequence-to-sequence models, effective for tasks such as summarization and machine translation, particularly under limited data conditions.
LED-base [33]: The Longformer Encoder-Decoder model extends BART with a sparse attention mechanism, enabling it to handle long documents and improving summarization performance on extended inputs.

4.3. Experimental Setting

We design six experimental configurations by combining three term integration strategies with two model usage paradigms: zero-shot (ZS) and supervised fine-tuning (SFT).

DirectLM-ZS/SFT: The original CTI text is directly fed into summarization models under either the zero-shot or supervised fine-tuning setting, without incorporating any auxiliary information.
AutoTerm+LM-ZS/SFT: Unsupervisedly extracted domain-specific terms are integrated as auxiliary input to enhance the summarization process in both zero-shot and fine-tuned settings.
GenTerm+LM-ZS/SFT: Automatically generated terms are used to guide the summarization model, providing structured cues under both zero-shot and fine-tuned conditions.

4.4. Implementation Details

We implement the proposed knowledge-guided CTI summarization framework by combining standard NLP tools and pretrained language models.

In the unsupervised term extraction phase, we adopt KeyBERT with the all-MiniLM-L6-v2 sentence embedding model. Candidate terms are ranked based on their cosine similarity to the overall document embedding, which is computed by averaging sentence embeddings of the full text. The top

k = 20

terms are selected as the initial term set.

We apply a multi-step term refinement process to improve quality and semantic coherence of extracted terms. Specifically, we remove function words (e.g., pronouns, conjunctions, interrogatives), normalize characters (e.g., stripping punctuation and non-standard symbols), eliminate semantically redundant terms based on Sentence-BERT embeddings (cosine similarity threshold = 0.85), and filter out low-information or overly generic terms (e.g., “document”, “system”, “file”) using a curated blacklist.

In the supervised term generation stage, we treat the refined term set as pseudo labels and fine-tune a T5-small sequence-to-sequence model. The model is trained with a maximum input length of 512 tokens and an output length of 64 tokens. It learns to generate task-relevant domain terminology directly from raw CTI texts.

The final term set used for summarization is generated by the fine-tuned T5-small model and serves as the input prefix in the knowledge-injected summarization paradigm. This structured terminology acts as lightweight guidance to help the model attend to critical threat semantics, thereby enhancing both factual consistency and domain relevance in the generated summaries.

During summarization, we follow a unified input construction paradigm in which the term set is prepended to the original CTI document. Terms are separated by semicolons and terminated with a period (e.g., “malware; phishing; C2 server.+CTI document’’) to explicitly direct the model’s attention to key concepts.

Example: Given a refined term set such as:

[APT29, phishing, credential dumping, C2 server, PowerShell, lateral movement],

The knowledge-injected input is formatted as:

APT29; phishing; credential dumping; C2 server; PowerShell; lateral movement. <CTI document text>

This format guides the model to attend to salient threat entities and tactics during encoding. We conduct experiments under both zero-shot and supervised settings. In the zero-shot setup, we directly prompt instruction-tuned LLMs such as GPT-3.5, Vicuna-13B, and LLaMA3-8B. In the supervised setup, we fine-tune encoder–decoder architectures including T5-base, LED-base, and BART-base using the CTISum benchmark, using a maximum input length of 1024 and standard sequence-level cross-entropy loss. All models retain the terminology-injected input structure during training and inference, allowing us to systematically evaluate the generalization of the proposed strategy across paradigms and architectures.

4.5. Evaluation Metrics

To comprehensively assess summarization quality, we adopt two widely recognized metrics: ROUGE-n [34] and BERTScore [35]. ROUGE-n evaluates lexical overlap between generated and reference summaries, with ROUGE-1 and ROUGE-2 measuring unigram and bigram precision/recall, and ROUGE-L assessing the longest common subsequence (LCS), capturing sentence-level fluency and structure. BERTScore, in contrast, leverages contextualized embeddings from a pre-trained RoBERTa-large model to measure semantic similarity at the token level, offering a more nuanced evaluation of meaning preservation beyond surface-level matches. Given the technical nature and semantic richness of CTI summaries, this combination balances surface fidelity with deep semantic alignment. To enhance the stability and credibility of the results, we report the mean and standard deviation for each experimental setting and conduct paired t-tests (p-values) to assess statistical significance.

4.6. Experimental Results

This experiment evaluates the performance of different models and input strategies on two CTI summarization tasks, CTIS and APS. The results, as shown in Table 1 and Table 2, indicate the following. (1) Effectiveness of Terminology: Across all models and tasks, the input strategies—DirectLM, AutoTerm+LM, and GenTerm+LM—demonstrate a consistent upward trend in both BERTScore and ROUGE metrics. This confirms that incorporating and optimizing domain-specific terminology significantly improves summary quality, validating the effectiveness of the proposed hybrid domain term construction pipeline. (2) Model Performance Comparison: Under the zero-shot setting, GPT-3.5 achieves the best performance on both CTIS and APS tasks, outperforming Vicuna-13B and LLAMA3-8B in terms of both BERTScore and ROUGE metrics. Under the supervised setting, BART-base performs the best among all models, especially when combined with the GenTerm+LM-SFT strategy. It achieves ROUGE-1/2/L scores of 57.37/24.20/36.06 on CTIS, and 57.87/24.07/39.42 on APS—the highest among all settings. Overall, supervised fine-tuned models significantly outperform zero-shot models, indicating the advantage of task-specific training. It is also worth noting that GPT-3.5 consistently achieves the highest BERTScore, reflecting its strong semantic understanding. (3) Impact of Input Strategy: For all models, transitioning from DirectLM to GenTerm+LM leads to steady improvements in BERTScore and all ROUGE metrics. This demonstrates that optimized terminology not only improves lexical matching but also enhances semantic coherence. Notably, BART-base combined with GenTerm+LM-SFT yields the best overall performance, highlighting its strong generalization and robustness in real-world applications. (4) Task Comparison: The CTIS task consistently yields higher scores than the APS task, suggesting that summaries for CTIS are relatively easier to generate with higher quality. Furthermore, under the same model and strategy, CTIS outperforms APS in both ROUGE and BERTScore. However, both tasks exhibit consistent performance gains from enhanced input strategies, indicating the generalizability and effectiveness of the proposed terminology-oriented approach across different summarization scenarios.

To further validate the reliability of our findings, we report the mean and standard deviation of evaluation scores over three independent runs for the BART-base model under all major supervised input strategies (see Table 3 and Table 4). The low standard deviations indicate stable performance across runs. In addition, we conduct paired t-tests between the baseline DirectLM and our GenTerm+LM strategy. The improvements in both ROUGE and BERTScore metrics are statistically significant (

p < 0.05

), reinforcing the robustness and consistency of the observed performance gains. Specifically, for the CTIS task, the p-values for BERTScore, ROUGE-1, ROUGE-2, and ROUGE-L are

6.09 \times 10^{- 6}

,

1.39 \times 10^{- 5}

,

6.76 \times 10^{- 5}

, and

2.11 \times 10^{- 5}

, respectively. For the APS task, the corresponding p-values are

3.73 \times 10^{- 5}

,

3.66 \times 10^{- 5}

,

1.56 \times 10^{- 5}

, and

1.19 \times 10^{- 4}

. These results confirm that the proposed term-oriented input strategy consistently and significantly outperforms the baseline approach.

4.7. Ablation Study

Compare with term-only input to assess standalone informativeness. To further examine the standalone contribution of domain-specific terms, we introduce an additional baseline—TermOnly—where the summarization model receives only the generated domain terms as input, without the original CTI document. This setting isolates the informativeness of the selected terminology and evaluates its ability to support summary generation in the absence of context. As shown in Table 5, TermOnly-SFT yields significantly lower performance compared to full-text input strategies. On the CTIS task, BART-base with TermOnly-SFT achieves ROUGE-1/2/L scores of 27.03/10.12/15.65, while GenTerm+LM-SFT yields 57.37/24.20/36.06. On the APS task, TermOnly-SFT obtains 29.56/11.45/17.24, compared to 57.87/24.07/39.42 for GenTerm+LM-SFT. Despite the gap, the summaries produced under the TermOnly setting still reflect key concepts and threat-related entities, suggesting that the generated terms retain substantial semantic information.

These results support the conclusion that domain terms—when carefully selected and refined—are semantically informative and can partially guide summarization even without full document input. They further emphasize the central role of terminology in enhancing both lexical and semantic relevance.

Remove term input to verify the contribution of domain terminology. To assess the contribution of domain-specific terminology in improving summarization performance, we remove the input strategies enhanced through our hybrid domain term construction pipeline, and compare the results of DirectLM (without terminology) against AutoTerm+LM and GenTerm+LM (with terminology). This comparison is conducted under both zero-shot and supervised settings across the CTIS and APS tasks. As shown in Table 1 and Table 2, the removal of domain terminology (i.e., using DirectLM as input strategy) leads to a noticeable drop in all evaluation metrics (BERTScore, ROUGE-1/2/L) for each model: In the CTIS task, for instance, BART-base with DirectLM-SFT achieves ROUGE-1/2/L scores of 45.76/16.88/29.03, while the same model with GenTerm+LM-SFT reaches 57.37/24.20/36.06—an absolute improvement of 11.61 (ROUGE-1), 7.32 (ROUGE-2), and 7.03 (ROUGE-L). Similarly, on the APS task, BART-base with GenTerm+LM-SFT outperforms DirectLM-SFT by 17.62 (ROUGE-1), 12.42 (ROUGE-2), and 14.36 (ROUGE-L), demonstrating the critical role of terminology in more complex summarization settings.

Consistent improvements are also observed in zero-shot models. For example, GPT-3.5 improves from ROUGE-1/2/L of 42.38/12.57/22.23 (DirectLM-ZS) to 43.56/13.64/23.20 (GenTerm+LM-ZS) on CTIS, indicating that even LLMs benefit from enriched terminology inputs. These findings clearly validate that domain terminology significantly contributes to enhancing both semantic relevance and lexical overlap in the generated summaries. Moreover, the effect is particularly pronounced under supervised settings, where the model can better exploit structured terminological inputs during training.

Compare unrefined and refined/generated term sets to assess term quality impact. To further evaluate the influence of domain terminology quality on summarization performance, we compare AutoTerm (automatically extracted but unrefined term sets) and GenTerm (refined or generated term sets via our hybrid construction pipeline). As shown in Table 1 and Table 2, using GenTerm consistently outperforms AutoTerm across all models and evaluation metrics: In the CTIS task, BART-base improves from 51.80/20.54/32.52 (AutoTerm) to 57.37/24.20/36.06 (GenTerm), with absolute gains of +5.57 (ROUGE-1), +3.66 (ROUGE-2), and +3.54 (ROUGE-L). On APS, the improvement is more pronounced, rising from 49.06/17.86/32.24 to 57.87/24.07/39.42, with gains of +8.81, +6.21, and +7.18, respectively.

Zero-shot models also benefit from term refinement. For instance, GPT-3.5 in the CTIS task improves from ROUGE-1/2/L scores of 42.95/13.05/22.78 (AutoTerm) to 43.56/13.64/23.20 (GenTerm), indicating the positive impact of higher-quality terms even without fine-tuning. These results demonstrate that not only the presence of domain terms, but also their quality—in terms of relevance, coverage, and expression—plays a vital role in enhancing summarization performance. The refined GenTerm sets, which incorporate semantic filtering and contextual generation, help guide the model toward more accurate and coherent summaries, especially under complex summarization tasks like APS.

Vary term set size to analyze sensitivity to the number of injected terms. To investigate how the number of injected domain-specific terms affects summarization performance, we vary the size of the term set used in the GenTerm+LM-SFT strategy. Specifically, we evaluate models using the top-k terms where

k \in {5, 10, 15, 20}

, and report the results on both CTIS and APS tasks. As shown in Figure 2, increasing the number of injected terms consistently improves summarization performance across all ROUGE metrics: In the CTIS task, ROUGE-1 improves from 50.10 (k = 5) to 57.37 (k = 20), a relative gain of +7.27. ROUGE-2 increases from 18.10 to 24.20 (+6.10), and ROUGE-L from 30.20 to 36.06 (+5.86). On the APS task, ROUGE-1 improves from 51.40 (k = 5) to 57.87 (k = 20), a gain of +6.47. ROUGE-2 rises from 18.90 to 24.07 (+5.17), and ROUGE-L from 33.20 to 39.42 (+6.22). These results indicate that injecting more high-quality domain terms leads to richer semantic guidance and improves both lexical overlap and content relevance. However, the performance gains gradually diminish as k increases, suggesting that while more terms provide more information, the marginal benefit decreases—likely due to redundancy or noise in the lower-ranked terms. This observation highlights the importance of balancing quantity and quality in terminology injection. Selecting an appropriate number of salient terms is crucial to achieving optimal summarization performance.

Explore the impact of different term insertion positions within the input.To analyze the influence of term placement on summarization performance, we compare three insertion strategies: placing the domain terms at the beginning (Prefix: Term+CTI), at the end (Postfix: CTI+Term), and inline-before their natural occurrences (Inline-Before) within the CTI text. We evaluate all strategies under the GenTerm+LM-SFT setting using BART-base in the CTIS and APS tasks. As shown in Table 6, inserting terms at the beginning of the input consistently yields the best performance across all evaluation metrics for both tasks. On the CTIS task, the prefix strategy achieves a ROUGE-1 of 57.37, ROUGE-2 of 24.20, and ROUGE-L of 36.06, outperforming postfix and inline strategies. Inline-Before achieves intermediate performance with ROUGE-1 of 56.80, ROUGE-2 of 23.65 and ROUGE-L of 35.45, demonstrating a viable alternative.

In the APS task, prefix input achieves ROUGE-1 of 57.87, ROUGE-2 of 24.07, and ROUGE-L of 39.42, while Inline-Before yields 56.35, 23.25, and 37.80, respectively—both outperforming the postfix strategy. BERTScore shows a consistent trend across both tasks. These results suggest that prefix placement guides the model to attend to key concepts early in the decoding process, whereas Inline-Before, though slightly less effective, provides localized cues that may enhance interpretability and alignment.

4.8. Input Length Analysis

To assess the sensitivity of the summarization model to varying input lengths, we evaluate the performance of BART-base with the GenTerm+LM-SFT strategy using different input sizes. Specifically, we group the train samples by their input length into five buckets: 200, 400, 600, 800, and 1024 tokens. For each group, we compute the average ROUGE-1, ROUGE-2, and ROUGE-L scores on both CTIS and APS tasks. As shown in Figure 3, the model’s performance consistently improves with longer input sequences: In the CTIS task, ROUGE-1 rises from 48.00 at 200 tokens to 57.37 at 1024 tokens. Similarly, ROUGE-2 increases from 17.50 to 24.20, and ROUGE-L from 29.50 to 36.06. On the APS task, ROUGE-1 grows from 48.50 to 57.87, ROUGE-2 from 17.80 to 24.07, and ROUGE-L from 31.20 to 39.42. These results suggest that longer inputs offer more contextual information, enabling the model to generate more informative and faithful summaries. The benefit is especially noticeable in the APS task, where complex procedural descriptions are better supported by extended inputs.

4.9. Few-Shot Analysis

To assess the effectiveness of the proposed summarization approach under low-resource scenarios, we conduct a detailed few-shot analysis by varying the amount of training data from 1% to 100%. Specifically, we evaluate the performance of BART-base with the GenTerm+LM-SFT strategy across seven data proportions: 1%, 5%, 10%, 30%, 50%, 70%, and 100%. The results are presented in Figure 4, reporting ROUGE-1, ROUGE-2, and ROUGE-L scores for both CTIS and APS tasks.

In the CTIS task, we observe consistent and substantial performance gains as the training data proportion increases: ROUGE-1 improves from 36.78 (1%) to 57.37 (100%), ROUGE-2 from 10.92 to 24.20, and ROUGE-L from 22.40 to 36.06. Notably, even at the 1% level, the model achieves reasonable scores, demonstrating its ability to generalize under extremely limited supervision. Significant improvements are observed with just 5% of the training data, and performance continues to improve steadily thereafter.

In the APS task, a similar trend is observed: ROUGE-1 rises from 37.22 (1%) to 57.87 (100%), ROUGE-2 from 11.04 to 24.07, and ROUGE-L from 22.18 to 39.42. The performance gap between 1% and 10% settings is particularly large, suggesting that even a small increase in data leads to notable gains in complex tasks like APS. The model reaches competitive performance with 30–50% of the data and saturates around 70%, confirming its efficiency and scalability.

These results validate the low-resource applicability of our term-guided summarization approach. The model exhibits strong learning capability from very few examples and achieves near-optimal performance with only a fraction of the training data. This makes the proposed framework particularly suitable for domains where annotated CTI summaries are scarce or expensive to obtain.

4.10. Case Study

To further evaluate summary fidelity and hallucination, we conduct a case study comparing the baseline and our model output on a representative CTI report. Table 7 shows that our model generates a factually consistent summary, avoiding common hallucinations such as incorrect years, sectors, or fabricated entities. Moreover, it achieves better coverage of threat-specific terminology, confirming the effectiveness of term-oriented input construction.

5. Conclusions

This work introduces a novel knowledge-guided approach for CTI summarization, which addresses key challenges in the cybersecurity domain by explicitly integrating domain-specific terminology into the summarization process. By leveraging a hybrid domain term construction pipeline and a structured input injection paradigm, the proposed method significantly enhances the model’s ability to understand and summarize complex threat narratives. Unlike previous approaches, the method requires no architectural modifications and is applicable to both zero-shot and fine-tuned scenarios, enabling flexible and scalable deployment. Experiments on the CTISum benchmark demonstrate consistent improvements across zero-shot and fine-tuning settings, validating the model-agnostic and architecture-independent nature of the approach. Further analysis quantifies the effects of terminology variants—such as quality, quantity and position—highlighting the importance of terminology-oriented input design for faithful and informative CTI summarization. Additionally, empirical results show that increasing input length and training data volume generally improves summary quality, particularly in complex tasks such as APS. Overall, this work provides a practical and extensible solution for improving CTI summarization quality in real-world cybersecurity applications.

6. Future Work and Adaptability to Emerging Threats

As the cyber threat landscape continues to evolve rapidly, attack techniques are becoming increasingly diverse and complex, posing significant challenges for intelligence extraction and summarization. To address this trend, our approach is designed with extensibility and adaptability in mind. Specifically, the domain term construction pipeline supports the continual inclusion and updating of terminology reflecting newly emerging threat techniques and defensive strategies. This ensures that the structured input can adapt to real-world dynamics and remains semantically aligned with the latest threat environment. Furthermore, the terminology-guided generation framework is decoupled from the underlying language model architecture, allowing seamless integration with various pre-trained models without structural modifications. This design choice facilitates fast adaptation to new knowledge domains and aligns with the rapid iteration needs of practical cybersecurity operations.

In future work, we plan to explore automated mechanisms for terminology self-evolution, incorporating continual learning strategies to enable the model to autonomously identify and integrate novel threat concepts in an unsupervised setting. We also aim to extend the framework to broader threat intelligence domains such as side-channel attacks and IoT-related threats, thereby enhancing the generality and practical value of the proposed method.

Author Contributions

Methodology, J.D. and Y.L.; Writing—original draft, J.D.; Validation, J.D. and Y.L.; Writing—review and editing, J.D. and Y.L.; Visualization, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China 2023YFB3107604.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nallapati, R.; Zhou, B.; Gulcehre, C.; Xiang, B. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv 2016, arXiv:1602.06023. [Google Scholar]
Cohan, A.; Dernoncourt, F.; Kim, D.S.; Bui, T.; Kim, S.; Chang, W.; Goharian, N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018; Walker, M.A., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 2, (Short Papers). pp. 615–621. [Google Scholar] [CrossRef]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, Long Papers. pp. 1073–1083. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Liu, Y.; Liu, P.; Radev, D.; Neubig, G. BRIO: Bringing order to abstractive summarization. arXiv 2022, arXiv:2203.16804. [Google Scholar] [CrossRef]
Shukla, A.; Bhattacharya, P.; Poddar, S.; Mukherjee, R.; Ghosh, K.; Goyal, P.; Ghosh, S. Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2022, Online Only, 20–23 November 2022; He, Y., Ji, H., Liu, Y., Li, S., Chang, C., Poria, S., Lin, C., Buntine, W.L., Liakata, M., Yan, H., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; Volume 1, Long Papers. pp. 1048–1064. [Google Scholar]
Zhu, Y.; Yang, X.; Wu, Y.; Zhang, W. Leveraging summary guidance on medical report summarization. IEEE J. Biomed. Health Inform. 2023, 27, 5066–5075. [Google Scholar] [CrossRef] [PubMed]
Peng, W.; Ding, J.; Wang, W.; Cui, L.; Cai, W.; Hao, Z.; Yun, X. CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization. arXiv 2024, arXiv:2408.06576. [Google Scholar] [CrossRef]
Li, J.; Zhou, H.; Wu, S.; Luo, X.; Wang, T.; Zhan, X.; Ma, X. {FOAP}:{Fine-Grained}{Open-World} android app fingerprinting. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1579–1596. [Google Scholar]
Ni, T.; Lan, G.; Wang, J.; Zhao, Q.; Xu, W. Eavesdropping mobile app activity via {Radio-Frequency} energy harvesting. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 3511–3528. [Google Scholar]
Li, J.; Wu, S.; Zhou, H.; Luo, X.; Wang, T.; Liu, Y.; Ma, X. Packet-level open-world app fingerprinting on wireless traffic. In Proceedings of the 2022 Network and Distributed System Security Symposium (NDSS’22), San Diego, CA, USA, 24–28 April 2022. [Google Scholar]
Ni, T.; Zhang, X.; Zuo, C.; Li, J.; Yan, Z.; Wang, W.; Xu, W.; Luo, X.; Zhao, Q. Uncovering user interactions on smartphones via contactless wireless charging side channels. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 21–25 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3399–3415. [Google Scholar]
Huang, W.; Chen, H.; Cao, H.; Ren, J.; Jiang, H.; Fu, Z.; Zhang, Y. Manipulating voice assistants eavesdropping via inherent vulnerability unveiling in mobile systems. IEEE Trans. Mob. Comput. 2024, 23, 11549–11563. [Google Scholar] [CrossRef]
Cao, H.; Liu, D.; Jiang, H.; Luo, J. MagSign: Harnessing dynamic magnetism for user authentication on IoT devices. IEEE Trans. Mob. Comput. 2022, 23, 597–611. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 6–8 July 2020. [Google Scholar]
Paulus, R.; Xiong, C.; Socher, R. A Deep Reinforced Model for Abstractive Summarization. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Nan, F.; Nallapati, R.; Wang, Z.; dos Santos, C.; Zhu, H.; Zhang, D.; Mckeown, K.; Xiang, B. Entity-level Factual Consistency of Abstractive Text Summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Virtual, 19–23 April 2021. [Google Scholar]
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning; CiteSeer: University Park, PA, USA, 2003; Volume 242, pp. 29–48. [Google Scholar]
Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Li, X.; Wang, K.; Chen, L. DeepCTI: Domain Term Recognition for Cyber Threat Intelligence. In Proceedings of the IEEE S&P Workshops, San Francisco, CA, USA, 23–26 May 2022. [Google Scholar]
Shang, J.; Shen, Y.; Liu, L. Automated Construction of a Cybersecurity Ontology Based on the MITRE ATT&CK Framework. In Proceedings of the IEEE BigData, Seattle, WA, USA, 10–13 December 2018. [Google Scholar]
Meng, Y.; Zhang, Y.; Huang, J.; Wang, X.; Zhang, Y.; Ji, H.; Han, J. Distantly Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmentation. arXiv 2021, arXiv:2109.05003. [Google Scholar] [CrossRef]
Gu, J.; Zhu, C.; Xu, M.; Gao, J. Knowledge-Aware Decoding for Factual Abstractive Summarization. In Proceedings of the ACL, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Li, J.; Liu, J.; Ma, J.; Yang, W.; Huang, D. Boundary-aware abstractive summarization with entity-augmented attention for enhancing faithfulness. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 28, 1–18. [Google Scholar] [CrossRef]
Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
Roumeliotis, K.I.; Tselikas, N.D. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 2023, 15, 192. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]

Figure 1. Overview of the proposed approach for CTIS and APS tasks.

Figure 2. Term set size analysis.

Figure 3. Input length analysis.

Figure 4. Training data size analysis.

Table 1. Model performance in the CTIS Task with varying input strategies. The best results are highlighted in bold. R-1: ROUGE-1, R-2: ROUGE-2, R-L: ROUGE-L, BertS: BertScore.

Model Type	Model Name	Input Strategy	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
Zero-shot	Vicuna-13B	DirectLM-ZS	82.66	27.17	8.25	13.53
		AutoTerm+LM-ZS	82.70	27.52	8.75	13.99
		GenTerm+LM-ZS	82.74	27.94	9.02	14.30
	LLAMA3-8B	DirectLM-ZS	32.98	24.11	6.22	21.89
		AutoTerm+LM-ZS	34.05	24.56	6.88	22.10
		GenTerm+LM-ZS	35.28	25.00	7.52	22.32
	GPT-3.5	DirectLM-ZS	86.93	42.38	12.57	22.23
		AutoTerm+LM-ZS	87.02	42.95	13.05	22.78
		GenTerm+LM-ZS	87.10	43.56	13.64	23.20
Supervised	T5-base	DirectLM-SFT	69.21	42.41	14.35	27.32
		AutoTerm+LM-SFT	71.46	47.12	17.52	29.94
		GenTerm+LM-SFT	73.78	52.30	20.86	33.12
	LED-base	DirectLM-SFT	70.31	45.39	15.66	27.09
		AutoTerm+LM-SFT	71.96	48.52	18.40	30.12
		GenTerm+LM-SFT	74.05	52.94	21.32	33.10
	BART-base	DirectLM-SFT	70.41	45.76	16.88	29.03
		AutoTerm+LM-SFT	73.52	51.80	20.54	32.52
		GenTerm+LM-SFT	76.12	57.37	24.20	36.06

Table 2. Model performance in the APS task with varying input strategies. The best results are highlighted in bold. R-1: ROUGE-1, R-2: ROUGE-2, R-L: ROUGE-L, BertS: BertScore.

Model Type	Model Name	Input Strategy	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
Zero-shot	Vicuna-13B	DirectLM-ZS	79.98	15.31	3.87	8.4
		AutoTerm+LM-ZS	80.21	15.75	4.15	8.72
		GenTerm+LM-ZS	80.43	16.25	4.40	9.03
	LLAMA3-8B	DirectLM-ZS	23.00	17.66	4.05	14.78
		AutoTerm+LM-ZS	23.07	17.95	4.20	15.01
		GenTerm+LM-ZS	23.12	18.35	4.35	15.24
	GPT-3.5	DirectLM-ZS	84.31	30.75	6.80	16.56
		AutoTerm+LM-ZS	84.45	31.32	7.00	17.12
		GenTerm+LM-ZS	84.59	31.90	7.20	17.70
Supervised	T5-base	DirectLM-SFT	66.06	32.09	7.62	21.78
		AutoTerm+LM-SFT	72.20	44.80	14.90	29.40
		GenTerm+LM-SFT	74.53	49.87	18.26	32.91
	LED-base	DirectLM-SFT	70.41	39.91	9.78	23.93
		AutoTerm+LM-SFT	72.58	44.49	13.82	28.09
		GenTerm+LM-SFT	75.69	51.34	19.49	34.41
	BART-base	DirectLM-SFT	70.32	40.25	11.65	25.06
		AutoTerm+LM-SFT	74.75	49.06	17.86	32.24
		GenTerm+LM-SFT	79.17	57.87	24.07	39.42

Table 3. BART-base performance in the CTIS task with mean ± std over three runs. The best results are highlighted in bold.

Task	Input Strategy	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
CTIS	DirectLM-SFT	70.41 ± 0.06	45.75 ± 0.07	16.88 ± 0.05	29.03 ± 0.06
	AutoTerm+LM-SFT	73.52 ± 0.05	51.83 ± 0.08	20.54 ± 0.06	32.52 ± 0.07
	GenTerm+LM-SFT	76.12 ± 0.04	57.37 ± 0.06	24.20 ± 0.05	36.06 ± 0.05

Table 4. BART-base performance in the APS task with mean ± std over three runs. The best results are highlighted in bold.

Task	Input Strategy	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
APS	DirectLM-SFT	70.32 ± 0.10	40.23 ± 0.13	11.65 ± 0.09	25.06 ± 0.10
	AutoTerm+LM-SFT	74.75 ± 0.08	49.06 ± 0.12	17.86 ± 0.10	32.24 ± 0.11
	GenTerm+LM-SFT	79.17 ± 0.09	57.87 ± 0.15	24.07 ± 0.11	39.42 ± 0.12

Table 5. Performance of BART-base using TermOnly input compared to full-text strategies (GenTerm+LM-SFT) on CTIS and APS tasks. The best results are highlighted in bold.

Task	Input Strategy	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
CTIS	TermOnly-SFT	58.41	27.03	10.12	15.65
	GenTerm+LM-SFT	76.12	57.37	24.20	36.06
APS	TermOnly-SFT	61.70	29.56	11.45	17.24
	GenTerm+LM-SFT	79.17	57.87	24.07	39.42

Table 6. Model performance in the CTIS and APS tasks with varying term positions.

Task	Term Position	BertS ↑	R-1 ↑	R-2 ↑	R-L ↑
CTIS	Prefix(Term+CTI)	76.12	57.37	24.20	36.06
	Postfix(CTI+Term)	75.70	55.80	22.90	34.70
	Inline-Before	75.98	56.80	23.65	35.45
APS	Prefix(Term+CTI)	79.17	57.87	24.07	39.42
	Postfix(CTI+Term)	78.05	54.20	21.50	35.90
	Inline-Before	78.62	56.35	23.25	37.80

Table 7. Case Study on summary fidelity and hallucination reduction.

Component	Baseline Summary (DirectLM-SFT)	Ours (GenTerm+LM-SFT)
Source CTI Excerpt	APT38 is a North Korean cyber espionage group known for financially motivated attacks. In 2020, they exploited CVE-2020-1472 (Zerologon) to gain unauthorized access to financial institutions in Southeast Asia. Reports indicate usage of malware families such as BeagleBoyz and Lazarus tools for lateral movement and data exfiltration.
Summary Output	APT38 conducted attacks in 2021 targeting the healthcare sector using newly developed malware. They used Zerologon to access hospital networks and steal patient records.	APT38 exploited CVE-2020-1472 (Zerologon) to breach financial institutions in Southeast Asia in 2020. They utilized BeagleBoyz and Lazarus tools to perform lateral movement and data exfiltration.
Factual Accuracy	Incorrect year (2021), Wrong sector (healthcare), Invented entities (patient records)	Correct date (2020), Correct target (financial institutions), Correct tools (BeagleBoyz, Lazarus)
Hallucination	Yes (fabricated sector, tools, and target)	No
Terminology Coverage	Misses key terms like BeagleBoyz, Lazarus	Includes core threat terms and CVE identifiers

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, J.; Lu, Y. Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction. Electronics 2025, 14, 3096. https://doi.org/10.3390/electronics14153096

AMA Style

Ding J, Lu Y. Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction. Electronics. 2025; 14(15):3096. https://doi.org/10.3390/electronics14153096

Chicago/Turabian Style

Ding, Junmei, and Yueming Lu. 2025. "Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction" Electronics 14, no. 15: 3096. https://doi.org/10.3390/electronics14153096

APA Style

Ding, J., & Lu, Y. (2025). Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction. Electronics, 14(15), 3096. https://doi.org/10.3390/electronics14153096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge-Guided Cyber Threat Intelligence Summarization via Term-Oriented Input Construction

Abstract

1. Introduction

2. Related Work

2.1. Automatic Summarization

2.2. Domain Term Extraction

2.3. Terminology-Oriented Summarization Methods

3. Approach

3.1. Hybrid Domain Term Construction Pipeline

3.1.1. Unsupervised Term Extraction

3.1.2. Supervised Term Generation

3.2. Knowledge-Injected Input Composition Paradigm

3.2.1. Knowledge-Injected Input Modeling

3.2.2. Model Training or Inference Objective

4. Experiment

4.1. Dataset

4.2. Baselines

4.3. Experimental Setting

4.4. Implementation Details

4.5. Evaluation Metrics

4.6. Experimental Results

4.7. Ablation Study

4.8. Input Length Analysis

4.9. Few-Shot Analysis

4.10. Case Study

5. Conclusions

6. Future Work and Adaptability to Emerging Threats

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI