1. Introduction
Amidst accelerating global digitalization, cybersecurity threats have escalated in both sophistication and frequency, while the proliferation of unstructured data (e.g., log files and network traffic records) presents significant analytical challenges [
1]. Recent years have witnessed a surge in cybersecurity incidents, with associated damages growing increasingly severe. The 2017 Equifax breach exemplifies this trend, wherein attackers exploited vulnerabilities in Apache Struts to exfiltrate sensitive data—including names, addresses, and social security numbers—from over 147 million users, resulting in unprecedented economic damages and systemic societal repercussions [
2]. This incident not only underscores the growing complexity of cyberattack methodologies but also reveals the fundamental limitations in conventional security approaches when processing unstructured data. Malicious actors increasingly obfuscate their attack vectors through advanced persistent threats, such as polymorphic malware and AI-driven phishing campaigns [
3], thereby exacerbating defense complexities. Consequently, developing robust methods for extracting actionable intelligence from massive unstructured datasets constitutes a critical research imperative to advance cybersecurity analytics capabilities.
In response to these challenges, cybersecurity researchers and practitioners are actively pursuing advanced technical solutions. The precise identification and analysis of attack patterns, threat intelligence, and vulnerability information constitutes the cornerstone for enabling timely incident detection and response [
4,
5]. Recent national-level initiatives are establishing standardized threat intelligence frameworks through the synergistic integration of expert knowledge and machine learning techniques to enhance behavioral analysis and response efficacy. Within this paradigm, knowledge graph technology has emerged as a transformative solution for cybersecurity through structured data organization and semantic analysis. Initially developed to enhance search engine efficiency via structured entity–relationship representations, knowledge graphs now enable the systematic storage and semantic reasoning of cybersecurity entities—including threat indicators (e.g., malicious IP addresses and malware families [
6]), attack vectors (e.g., DDoS (Distributed Denial of Service) attacks [
7] and SQL injection [
8]), and vulnerability data (e.g., CVE (Common Vulnerabilities and Exposures) entries). This approach not only facilitates automated threat interrogation and decision support but also empowers security analysts to efficiently identify critical attack chain components and construct comprehensive attack provenance graphs, thereby substantially strengthening cyber defense mechanisms [
9].
Over 70 nations worldwide have implemented nationally coordinated cybersecurity strategies and legal frameworks to combat evolving cyber threats [
10]. Nevertheless, the effective integration of knowledge graph technology with unstructured security data remains a critical research frontier in cybersecurity. This challenge manifests in three core dimensions: (1) the efficient extraction of critical entities from unstructured data streams, (2) the maintenance of contextual coherence during segmentation processes, and (3) the robust handling of multi-layered semantic relationships. Conventional document segmentation approaches frequently induce semantic fragmentation, significantly compromising the structural integrity and ontological validity of knowledge graphs [
11]. The synergistic combination of knowledge graphs and unstructured security data enables semantic reasoning and contextual association mining across massive datasets, thereby delivering enhanced decision support for cyber defense systems. Advancements in this domain promise to advance cyber defense maturity levels, strengthen security postures for governmental, corporate, and individual stakeholders, and mitigate systemic risk exposure [
12,
13].
Therefore, this paper proposes a cybersecurity knowledge graph construction method, SC-LKM, which wakes up the document lake by hierarchical semantic chunking to ensure that data segmentation is consistent with the inherent document semantics, and alleviates information fragmentation while maintaining the structural integrity of the knowledge graph. By integrating a large language model for high-level language understanding, SC-LKM achieves the collaborative optimization of knowledge extraction accuracy and multi-hop reasoning ability, thereby improving the accuracy and operational utility of knowledge graphs. Extensive benchmark experiments show that SC-LKM achieves statistically significant advantages in constructing cybersecurity knowledge graphs.
In summary, this work makes three primary contributions:
- (1)
Proposing a novel integration of the GraphRAG framework with adaptive semantic chunking mechanisms;
- (2)
Developing a hierarchical semantic chunking strategy that dynamically adapts to document structure and semantic coherence patterns;
- (3)
Establishing an LLM-enhanced knowledge extraction pipeline with multi-hop reasoning capabilities for cybersecurity threat intelligence.
2. Related Work
Cybersecurity knowledge graph technology has evolved from rule-based systems to intelligent learning paradigms. Early rule-driven approaches exhibited inherent limitations in processing multi-source heterogeneous intelligence and adapting to dynamic threat landscapes. Contemporary methodologies leverage deep learning and generative models to enhance dynamic threat modeling capabilities. Zhao et al. [
14] provide a comprehensive overview of this progression, highlighting the shift toward neural architectures, pre-trained language models, and hybrid KG-LLM frameworks in cyber threat intelligence (CTI) systems.
Jia et al. [
15] developed a cybersecurity knowledge base using a five-element model, implementing domain-specific entity recognition through Stanford NER augmented with gazetteer features for threat entity extraction and ontology construction. Han and Wang [
16] extended multi-entity relation extraction in long texts via entity masking with BERT embeddings. Mouiche and Saad [
17] proposed the TiKG framework, integrating SecureBERT domain-adaptive pre-training with attention-enhanced Bidirectional LSTM (BiLSTM) architectures. This approach enables accurate APT (Advanced Persistent Threat) entity identification and low-error propagation of threat relationships through domain-specific ontological constraints, demonstrating enhanced construction accuracy across four benchmark datasets including DNRTI. Du et al. [
18] implemented vulnerability knowledge fusion through BLSTM-CRF-based attribute extraction and hierarchical rule-based relationship triggering. Li et al. [
19] introduced a BiLSTM-GNN hybrid architecture that captures contextual sequence patterns and performs graph-based relation extraction. Sangher et al. [
20] further explored deep sequential models for intent identification on social platforms and darknet forums, showing that LSTM + BERT pipelines can surface implicit exploitation cues valuable for CTI enrichment.
Hu et al. [
21] devised the LLM-TIKG framework, employing GPT-powered few-shot learning to generate labeled data for 7B Llama2 fine-tuning, achieving the end-to-end extraction of threat intelligence entity-TTP-relation triples. Zhang et al. [
22] developed the AttackKG+ framework, utilizing LLM collaboration through rewriter–parser–identifier–digester modules to construct multidimensional threat maps with MITRE TTP integration. Huang and Xiao [
23] proposed the CKG method, processing multi-source CTI articles via segmented multi-LLM agent architecture and dual-memory mechanisms to establish traceable entity–relationship graphs. Paul et al. [
24] extended this direction by introducing proactive LLM-based threat reasoning that infers attacker goals and behavior chains, while Wu et al. [
25] integrated knowledge-graph constraints into LLM outputs to improve the credibility assessment of extracted intelligence.
Despite these advances, existing methods predominantly overlook the disruptive effects of semantic segmentation on knowledge integrity, leading to information fragmentation and global reasoning failures. This work therefore proposes a segmentation-aware extraction framework to ensure semantic continuity and holistic reasoning across long CTI documents.
3. Materials and Methods
3.1. Dataset
Our threat intelligence corpus comprises three primary sources: (1) open-source intelligence (OSINT) reports, (2) commercial security vendor disclosures, and (3) dark web underground forums. Unlike general-domain corpora such as news articles or encyclopedic content, CTI texts exhibit a range of domain-specific characteristics. These include highly specialized entities (e.g., CVE IDs, malware hashes, and MITRE TTPs), diverse structural styles due to multi-source origins, and the frequent use of implicit or obfuscated threat semantics. The corpus also tends to have low annotation density and suffers from severe class imbalance, with actionable threat information sparsely distributed among mostly descriptive content.
These challenges make CTI corpora significantly more complex to process than conventional datasets, particularly in tasks involving knowledge extraction and semantic segmentation. The raw data is processed through MinerU—an open-source Python library—following a standardized conversion pipeline: from Markdown to plaintext formats. During PDF-to-text conversion, MinerU automatically strips PDF metadata (headers/footers) and non-textual elements while preserving semantic integrity. Following format conversion, the plaintext files undergo rigorous cleansing to eliminate irrelevant metadata, non-linguistic symbols, and encoding artifacts. This dual-stage preprocessing ensures textual consistency and prepares the corpus for subsequent knowledge extraction tasks.
3.2. Semantic Chunking-Enhanced Knowledge Graph Construction
3.2.1. Semantic Chunking Methodology
Traditional text-segmentation approaches primarily employ fixed-length windowing or punctuation-based heuristics. Although computationally efficient, these methods compromise semantic coherence and can induce information fragmentation that degrades downstream task performance. In critical applications such as information retrieval and natural-language processing (NLP) [
26], the quality of text chunks largely determines system efficacy.
Building on an initial coarse split that respects the chapter–section–paragraph hierarchy, our method further refines boundaries through paragraph-level semantic analysis. Semantic chunking operates through structural analysis of paragraph-level semantics, combining topic-distribution similarity and named-entity tracking to identify coherent boundaries. As shown in
Figure 1, our method detects natural discourse boundaries (e.g., section transitions and entity clusters) and aggregates logically related content into coherent semantic units. This process enhances both local semantic integrity and global contextual awareness, which are essential for accurate information retrieval.
In contrast to rule-based segmentation, our semantic chunking methodology maintains three-dimensional context preservation:
Vertical Context: Retains hierarchical document structure (chapter→section→paragraph).
Horizontal Context: Preserves entity–relationship continuity across adjacent chunks.
Temporal Context: Maintains the original event sequence within chunk boundaries, ensuring logical flow is preserved in narrative-based documents.
To improve reproducibility, Algorithm 1 formally details the
hierarchical semantic-chunking procedure introduced in the abstract. After an initial structural split that respects chapter–section–paragraph boundaries, the algorithm iteratively refines those segments by paragraph-level analysis, detecting chunk breaks with topic similarity and named-entity overlap to preserve discourse coherence and entity continuity.
Algorithm 1 Semantic chunking algorithm. |
/* Stage 1 (structure) is applied beforehand. This pseudocode implements Stage 2: semantic refinement. */
- Require:
Document D with paragraphs - Ensure:
Semantic chunks - 1:
Initialize empty chunk list - 2:
Initialize current chunk ; current token length - 3:
for to n do - 4:
Extract paragraph - 5:
Compute topic similarity - 6:
Extract named entities - 7:
Compute entity overlap - 8:
Get length - 9:
if or or then - 10:
Append c to C - 11:
Reset - 12:
end if - 13:
Append to c; - 14:
end for - 15:
if then - 16:
Append c to C - 17:
end if - 18:
return C
|
Algorithm 1 therefore realizes the two-level segmentation strategy: document hierarchy first, semantic refinement second. By combining topic similarity, entity continuity, and length control, it produces coherent chunks that avoid both abrupt cuts and overlong segments—crucial for reliable knowledge-graph construction and downstream threat-intelligence analysis. Through discourse-level continuity preservation, our approach establishes reliable feature representations for knowledge-graph construction and threat-intelligence extraction [
27,
28].
3.2.2. Knowledge Graph Construction
Retrieval-Augmented Generation (RAG) systems exhibit critical limitations when processing multi-source cybersecurity documents, particularly in preserving contextual relationships across segmented texts and maintaining the temporal coherence of threat intelligence.
The GraphRAG framework [
29] introduces a domain-specific knowledge graph construction methodology as visualized in
Figure 2. The initial document aggregation encompasses three intelligence streams: public security reports from entities such as CISA (Cybersecurity and Infrastructure Security Agency) and MITRE ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge), technical advisories from cybersecurity vendors, and unstructured data from dark web forums monitoring advanced persistent threat (APT) group activities. During the data preparation phase, adaptive text extraction preserves the original document structures while semantic chunking decomposes content into coherent text blocks with metadata annotations.
Subsequent knowledge extraction deploys large language models (LLMs) equipped with cybersecurity-specific named entity recognition (NER) to identify core elements including threat actors (e.g., APT29), attack techniques (T1190 exploitation), and associated tools (CobaltStrike). Relationship classifiers concurrently detect tactical patterns such as “Threat Actor executes Attack Technique” and “Incident is initiated by Threat Actor”. The constructed graph objects formalize these discoveries through typed nodes (Threat Actor, Tool, and TTPs (Tactics, Techniques, and Procedures)) and directional edges encoding both technical relationships (“uses”, “develops”) and tactical sequences (“precedes”, “escalates”).
This structured approach overcomes traditional limitations through two key innovations: context-aware text segmentation that retains document-specific features (headers, code blocks, and IOC (Indicator of Compromise) tables), and temporal graph embeddings that encode APT campaign lifecycle patterns. The resultant knowledge graph enables the effective fusion of heterogeneous threat intelligence while preserving the original contextual relationships.
To support evolving intelligence environments, the system leverages the inherent incremental update capability of GraphRAG. Newly collected documents are processed through the same semantic chunking and extraction pipeline. The extracted entities and relations are aligned with the existing knowledge graph without reprocessing the full corpus, allowing efficient updates in streaming or batch ingestion scenarios.
GraphRAG performs entity and relation consolidation through embedding-based soft alignment. Specifically, newly extracted nodes and edges are embedded into the same vector space as the existing graph, and cosine similarity is computed against existing nodes. If the similarity exceeds a predefined threshold and entity types match, the new element is merged with the existing one; otherwise, it is added as a distinct node. This approach reduces redundancy and maintains graph coherence while adapting to dynamic threat intelligence inputs.
3.2.3. Generative and Embedding Model Synergy
Generative and embedding models serve distinct yet complementary functions in natural language processing ecosystems. Generative models synthesize contextually coherent textual outputs through autoregressive prediction mechanisms, excelling in open-ended tasks like contextual question answering, abstractive summarization, and dialogue generation. These models dynamically create novel content by extrapolating patterns from training data rather than relying on static retrieval. Conversely, embedding models encode textual inputs into dense vector representations through deep metric learning, projecting semantic relationships into a latent space where geometric distances correspond to conceptual similarities. This vector-space embedding enables efficient semantic search, clustering, and anomaly detection via similarity metrics like cosine distance.
Within the GraphRAG architecture, these model classes interact through a tightly coupled feedback loop. The embedding model constructs vector representations of threat intelligence documents, establishing semantic neighborhoods for related security entities such as malware variants sharing code signatures. During retrieval phases, these embeddings guide the generative model to relevant knowledge subgraphs through attention-based similarity matching, creating dynamic pathways between conceptual clusters. The generative model subsequently synthesizes contextualized responses by traversing connected entities in the retrieved subgraph, ensuring output consistency with both local evidence and global knowledge topology.
This symbiotic relationship achieves critical enhancements through three interdependent mechanisms. Semantic alignment emerges as embedding vectors ground generative outputs in domain-specific knowledge structures, while contextual grounding ensures generated text maintains consistency with retrieved evidence subgraphs. Notably, computational efficiency gains materialize through vector-space indexing, which reduces the search complexity from linear to logarithmic scaling—a vital feature when processing terabyte-scale threat intelligence corpora.
As illustrated in
Figure 3, the framework’s dual-model architecture enables simultaneous precision in semantic matching (via embeddings) and contextual adaptability (via generation), particularly crucial when processing polymorphic threat indicators like evolving APT tactics. The continuous interaction between vector-space reasoning and symbolic generation forms the cornerstone of our approach to dynamic knowledge graph construction.
In the left panel, the embedding model encodes threat intelligence documents into vector representations, constructs a semantic vector space, and facilitates knowledge retrieval through similarity matching. The generative model (right) synthesizes responses by traversing the graph structure, leveraging both the retrieved knowledge subgraph and contextual information. These components establish a tightly coupled feedback loop via attention mechanisms, where semantic vectors provide generation guidance, while generated outputs maintain consistency with both local evidence and global knowledge topology, ultimately ensuring optimal balance between semantic matching precision and contextual adaptability.
3.3. Knowledge Graph Quality Evaluation
To comprehensively assess the constructed cybersecurity knowledge graph, we establish three principal criteria—structural integrity [
30], information relevance [
31], and semantic consistency [
32]—through multidimensional quantitative analysis.Each subsequent subsection details the metric formulation and discusses its practical significance in cybersecurity applications.
3.3.1. Cluster Quality Assessment
Cluster analysis serves as a fundamental methodology for evaluating entity classification rationality in cybersecurity knowledge graphs, particularly for attack pattern categorization, vulnerability correlation analysis, and APT group profiling. Our evaluation framework incorporates two complementary metrics:
The
Silhouette Coefficient quantifies intra-cluster cohesion versus inter-cluster separation through geometric analysis in the embedding space. For an entity
i, its coefficient
is defined as
where
denotes the mean intra-cluster distance, and
represents the smallest mean inter-cluster distance. The overall silhouette score
S is computed as the average over all entities:
Here,
n is the total number of entities in the dataset, and a higher value of
S indicates more compact and well-separated clusters. Complementing this, the
Davies-Bouldin Index (DBI) evaluates cluster compactness-to-separation ratios:
Here, and represent the mean intra-cluster distances for clusters and , while and denote their centroid vectors in the embedding space. The L2-norm quantifies inter-cluster separation.
In operational cybersecurity analytics, a high silhouette score implies that threat entities—such as malware strains or vulnerability clusters—are well-defined and distinct in the embedding space, facilitating precise threat attribution. Conversely, lower DBI values signify compact and well-separated clusters, which support the robust classification of APT groups and reduce ambiguity in campaign linkage. These metrics directly affect the interpretability and utility of knowledge graphs for decision support and automated threat reasoning.
3.3.2. Correlation Analysis
Understanding the entity relationships within cybersecurity knowledge graphs requires the rigorous analysis of attack method–vulnerability–APT group correlations. Our framework employs Pointwise Mutual Information (PMI) to quantify co-occurrence patterns in threat intelligence events:
Here,
measures the statistical dependence between entities
x and
y, where elevated values indicate strong operational relationships (e.g., specific malware variants exploiting particular CVEs). The joint probability
reflects their co-occurrence frequency in security incidents, while
and
represent marginal occurrence probabilities. This metric effectively identifies latent threat actor infrastructure sharing patterns and attack tool reuse behaviors [
33]. Recent advances in knowledge representation learning further validate the efficacy of PMI for cybersecurity relationship modeling [
34].
High PMI values in threat intelligence data reveal statistically significant co-occurrence patterns between tactics, vulnerabilities, and malware tools. This informs threat propagation modeling and attack path inference, enabling the prioritization of high-risk vulnerabilities and enhancing proactive defense planning.
3.3.3. Semantic Consistency Evaluation
Entity-type distribution consistency across knowledge subgraphs serves as a critical quality indicator. We adopt Shannon entropy to quantify semantic coherence:
Here,
H measures the uniformity of entity-type distributions within text units, with
denoting the relative frequency of type
i. Lower entropy values (
) suggest concentrated entity distributions characteristic of well-structured threat narratives, while higher values (
) indicate fragmented semantic patterns. Originally developed for medical informatics [
35], this entropy-based approach effectively identifies ontological inconsistencies in cybersecurity knowledge graphs, guiding entity relationship optimization and taxonomy refinement.
Entropy quantifies the distributional focus of entity types within a knowledge graph. Lower entropy values indicate semantically coherent knowledge segments, which are critical for constructing domain-aligned threat ontologies and reducing analyst burden. Elevated entropy may suggest noise or topic drift, warranting further entity filtering or refinement.
4. Experimental Results
4.1. Experimental Setup
The experimental platform comprised an Intel Xeon Gold 6126 CPU with 32 physical cores and dual NVIDIA Tesla V100 GPUs (16GB VRAM each), hosted on an Ubuntu 18.04.6 LTS operating system. The software stack utilized Python 3.10 with CUDA (Compute Unified Device Architecture) 11.8 acceleration, implemented through PyCharm 2023.2 IDE. Graph data persisted in the Neo4j 5.11 graph database, while knowledge graph visualizations were rendered using Gephi 0.10.1 with Fruchterman–Reingold force-directed layout algorithms.
4.2. Model Selection
4.2.1. Generative Model Benchmarking
We selected Qwen2.5-14b-instruct as the core generative model for our GraphRAG framework after extensive benchmarking across five critical dimensions:
Information Extraction: Precision in parsing structured data from unstructured cyber threat intelligence (CTI).
Machine Reading Comprehension (MRC): Contextual understanding of technical reports.
Beyond the Hood (BBH): Multi-step logical reasoning capability.
Commonsense Reasoning: Domain-specific cybersecurity intuition.
Deductive Reasoning: Inferring attack patterns from partial indicators.
Table 1 presents a detailed comparison against Glm-4-9b-chat, DeepSeek-R1-Distill-Qwen-14B, Llama-3.1-70B-Instruct, and Internlm2_5-20b-chat. The balanced performance across reasoning and extraction tasks makes Qwen2.5-14b-instruct particularly suitable for constructing high-fidelity cybersecurity knowledge graphs in multilingual contexts.
Qwen2.5-14b-instruct demonstrates consistent superiority across key reasoning and extraction benchmarks. More importantly, it is pre-trained and instruction aligned on large-scale bilingual corpora, making it particularly effective for Chinese–English mixed inputs. In cybersecurity threat intelligence, documents frequently contain domain-specific Chinese terminology, transliterated attack names, and cross-language references. Qwen2.5 maintains contextual coherence and semantic precision under these conditions, outperforming other models in multilingual comprehension and structured knowledge extraction.
Compared with Llama-3.1-70B-Instruct and InternLM2_5-20b-chat, which are mainly optimized for English, Qwen2.5 shows stronger adaptability to Chinese-language inputs without requiring language-specific prompting. DeepSeek-R1-Distill-Qwen-14B, though based on the same architecture, underperforms due to reduced parameter scale and limited instruction tuning. Additionally, Qwen2.5 offers a favorable balance between performance and computational efficiency. Its 14B parameter size ensures robust capability in multi-step reasoning without incurring the resource cost of larger models. These properties make Qwen2.5 especially suitable for real-world cybersecurity knowledge graph construction in cross-lingual environments.
4.2.2. Impact of Embedding Models
The BGE-M3 embedding model [
36] is selected for the GraphRAG framework based on three critical capabilities: multilingual processing efficiency, long-context retention, and hybrid retrieval support. Built upon the XLM-RoBERTa architecture, BGE-M3 integrates three complementary retrieval modes:
Dense Retrieval: Semantic similarity modeling via high-dimensional embeddings.
Sparse Retrieval: Lexical matching with term-frequency weighting.
Multi-Vector Retrieval: Context-aware token-level attention pooling.
This architecture enables the effective processing of security documents up to 8192 tokens while maintaining contextual integrity—a crucial requirement for cyber threat intelligence analysis. The model’s native support for 100+ languages enhances cross-lingual threat indicator alignment, particularly beneficial for global threat intelligence integration.
In GraphRAG implementations, the hybrid retrieval capabilities of BGE-M3 facilitate precise semantic chunking and entity relationship discovery. Its balanced performance in recall and precision across heterogeneous security data streams demonstrates strong suitability for constructing domain-specific knowledge graphs with complex Tactics, Techniques, and Procedures (TTP) linkages.
4.3. Ablation Experiments
To systematically evaluate the SC-LKM framework’s core innovations, we conduct controlled experiments comparing hierarchical semantic chunking against conventional fixed-length segmentation. The ablation study maintains identical experimental conditions—including the Qwen2.5-14B-Instruct generator and entity extraction rules—while varying only the text segmentation strategy.
Analysis of multi-paragraph penetration test reports reveals critical limitations in fixed-length chunking approaches. Traditional 1200-character segmentation mechanically truncates attack vector descriptions, inducing semantic discontinuities that disrupt threat pattern recognition. In contrast, SC-LKM’s dynamic chunk granularity adaptation preserves cross-paragraph contextual relationships while achieving equivalent text coverage.
The experimental results in
Table 2 demonstrate three principal findings. First, semantic chunking contributes over 95% of entity discovery capability compared to fixed-length approaches. Second, the Qwen2.5 model exhibits superior entity recognition performance, outperforming alternative models by an order of magnitude. Finally, the integrated framework achieves 28× improvement in relationship extraction accuracy relative to baseline methods, confirming the necessity of coordinated segmentation and generation components.
To complement the foregoing quantity-oriented analysis, we manually annotate a benchmark of 50 representative reports and evaluate precision, recall, and F1 for both entity and relation extraction. Each configuration is executed five times with independent random seeds to estimate performance variance. We report mean ± standard deviation and use paired two-tailed
t-tests (
) to assess the statistical significance of F1 improvements yielded by semantic chunking over fixed-length segmentation (i.e., uniform 500-token windows with 100-token overlap); significant differences are marked in
Table 3 with * (
).
For fair comparison, we report results only for configurations that satisfy a minimum entity–relation generation threshold
(fixed refers to a naïve sliding window with 500-token span and 100-token overlap).
Table 4 presents the mean ± standard deviation of four structural metrics over five independent runs. Statistical significance is assessed via two-tailed paired
t-tests against the Qwen2.5-14B + semantic chunking baseline.
The ablation study reveals two fundamental insights regarding semantic chunking mechanisms. First, their removal severely degrades entity recognition integrity and relational coherence, particularly in long-text scenarios where information fragmentation errors increase by 63% compared to baseline. Second, when implemented with the Qwen2.5-14B-Instruct model, semantic chunking enhances complex structure parsing efficiency through three mechanisms: context-aware boundary detection, dynamic chunk granularity adjustment, and cross-paragraph dependency preservation.
From the evaluation metrics perspective, the integrated framework demonstrates superior performance across all quality dimensions. The proposed method achieves 2.93× higher Silhouette Coefficient and 76% lower DBI compared to the baseline approaches, indicating tighter cluster structures and better inter-class discrimination. The PMI improvement of 87.5% reflects stronger semantic correlations between the threat entities, while entropy is reduced by approximately 40%, confirming the effective mitigation of semantic fragmentation.
These results confirm that semantic chunking, through structured context modeling, provides a robust foundation for long-text processing in cybersecurity applications.
4.4. Evaluation of Long-Term Stability Across Document Variants
To assess the long-term deployment robustness of SC-LKM, we simulate three semantically equivalent variants of the original corpus, each incorporating realistic formatting drifts commonly observed in evolving threat-report templates—such as dense punctuation, inconsistent layout, and paragraph realignment.
Based on the original cleaned penetration test corpus, we generate three semantically equivalent variants: Variant A (dense punctuation), Variant B (loose formatting), and Variant C (paragraph shift). All versions are processed using the same semantic chunking and knowledge extraction procedure without any retraining or configuration changes. The corresponding extraction results, as shown in
Table 5, indicate that SC-LKM consistently preserves both entity-relation coverage and graph structural quality across diverse formatting conditions.
Across all variants, entity and relation counts vary by less than 6.5%, and all quality metrics deviate by no more than 7%. These stable results indicate that SC-LKM maintains consistent extraction performance and graph structural integrity under heterogeneous formatting conditions, validating its robustness and adaptability for long-term use in real-world cybersecurity deployments.
4.5. Analysis of Results
The constructed cybersecurity knowledge graph, comprising 1553 nodes and 1739 relationships, is visualized using the Neo4j graph database and Gephi visualization toolkit. As shown in
Figure 4, the graph architecture integrates three node types: entity nodes representing security concepts, chunk nodes preserving local context units, and document nodes linking to original threat intelligence sources. Relationship types encompass both intra-entity connections and cross-node semantic linkages.
Gephi’s force-directed layout algorithm reveals distinct topological characteristics: document nodes exhibit radial dispersion patterns around core entity clusters, while relationship edges form dense interconnections between attack technique nodes. This visualization effectively captures hierarchical threat knowledge structures, where parent–child relationships between attack patterns manifest as multi-layer concentric distributions.
Neo4j provides native support for efficient graph storage and complex query operations. Through its Cypher query language, analysts can perform multi-hop relationship tracing to reconstruct attack chains and identify latent threat patterns. The visual interface enables the intuitive exploration of entity relationships, significantly enhancing threat intelligence analysis workflows. Partial visualization results are presented in
Figure 5.
5. Discussion
The SC-LKM framework addresses two fundamental challenges in cybersecurity knowledge graph construction through its dual optimization design. The hierarchical semantic chunking mechanism bridges the gap between unstructured text processing and structured knowledge representation, outperforming traditional fixed-length segmentation methods by preserving cross-paragraph contextual relationships. This proves particularly crucial when analyzing advanced persistent threats that exhibit long-term operational patterns across multiple intelligence sources.
Central to the framework’s success is the synergistic integration of Qwen2.5-14B-Instruct’s deep reasoning capabilities with graph-based knowledge representation. Through cross-source entity disambiguation in multilingual intelligence feeds and temporal correlation analysis of attack campaigns, the system demonstrates enhanced capacity for latent threat pattern discovery. Compared with conventional RAG implementations, our approach significantly reduces semantic fragmentation while maintaining computational efficiency, particularly when processing Chinese-language threat intelligence where contextual nuances substantially impact relationship interpretation.
The current limitations stem primarily from dependency on pre-defined threat taxonomies for chunk boundary detection and latency constraints in real-time streaming scenarios. Future improvements could incorporate adaptive learning mechanisms to dynamically adjust chunking granularity based on threat severity indicators, potentially enhancing real-time processing capabilities without compromising detection accuracy.
While SC-LKM demonstrates the effective integration of semantic chunking and knowledge graph construction, several challenges remain in practical deployment scenarios. Maintaining consistent entity resolution across heterogeneous intelligence sources is non-trivial, particularly when handling aliasing, overlapping semantic scopes, or evolving terminologies. Conflicting or outdated information further complicates real-time updates, necessitating robust mechanisms for knowledge validation and version control. Additionally, concept drift in threat actor behaviors and terminology evolution may reduce the long-term relevance of previously extracted knowledge, affecting graph consistency and reasoning accuracy.
Future research should explore dynamic knowledge graph management strategies that support continual learning, schema evolution, and temporal reasoning. Possible directions include integrating time-decay functions for outdated edges, designing version-aware node merging policies, and implementing lifelong learning frameworks that adapt to emerging threat patterns without catastrophic forgetting. These enhancements would improve the system’s adaptability and reliability in long-term cybersecurity applications.
6. Conclusions
This study presents SC-LKM, a semantic-enhanced framework that advances cybersecurity knowledge graph construction through three principal innovations. The dynamic hierarchical chunking mechanism preserves document-level semantic coherence through context-aware boundary detection and adaptive granularity adjustment. Deep integration of large language model reasoning with graph-based knowledge representation enables the precise extraction of complex entity relationships from unstructured threat intelligence. Automated context propagation across multi-source data streams further ensures topological consistency in the constructed knowledge graphs.
Experimental results validate the framework’s superiority in entity relationship extraction and threat pattern recognition compared to conventional methods. These improvements are reflected in enhanced extraction accuracy, more coherent semantic chunking, and better preservation of topological structure across diverse unstructured inputs. By fundamentally addressing information fragmentation in long-text intelligence processing, SC-LKM establishes new technical foundations for intelligent security analytics. The system’s effectiveness in processing Chinese-language threat data opens new possibilities for regional cybersecurity infrastructure development. These findings collectively support the reliability and practicality of the proposed framework.
Future research directions should focus on multi-modal intelligence integration, particularly incorporating network traffic visualizations and malware behavioral profiles to enrich threat context understanding. Cross-lingual knowledge alignment mechanisms and real-time adaptive processing for streaming intelligence analysis represent additional critical frontiers for advancing automated threat detection systems.
Author Contributions
Conceptualization, P.W. and Y.Z.; Methodology, Z.Z.; Software, P.W. and Y.W.; Validation, P.W. and Z.Z.; Formal analysis, P.W. and Y.W.; Investigation, Z.Z.; Resources, Y.Z.; Data curation, Z.Z.; Writing—original draft preparation, P.W.; Writing—review and editing, Y.Z. and Y.W.; Visualization, Z.Z.; Supervision, Y.Z.; Project administration, Y.Z.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China under (grant number 62176023) and Supported by Beijing Natural Science Foundation (grant number L233008).
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
CSC | Cybersecurity |
KG | Knowledge Graph |
ML | Machine Learning |
AI | Artificial Intelligence |
CVE | Common Vulnerabilities and Exposures |
APT | Advanced Persistent Threats |
LLM | Large Language Model |
RAG | Retrieval-Augmented Generation |
References
- Bansal, B.; Jenipher, V.N.; Jain, R.; Dilip, R.; Kumbhkar, M.; Pramanik, S.; Roy, S.; Gupta, A. Big data architecture for network security. Cyber Secur. Netw. Secur. 2022, 233–267. [Google Scholar] [CrossRef]
- Novak, A.N.; Vilceanu, M.O. “The internet is not pleased”: Twitter and the 2017 Equifax data breach. Commun. Rev. 2019, 22, 196–221. [Google Scholar] [CrossRef]
- Chanti, S.; Chithralekha, T. A literature review on classification of phishing attacks. Int. J. Adv. Technol. Eng. Explor. 2022, 9, 446–476. [Google Scholar] [CrossRef]
- Sun, N.; Ding, M.; Jiang, J.; Xu, W.; Mo, X.; Tai, Y.; Zhang, J. Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives. IEEE Commun. Surv. Tutor. 2023, 25, 1748–1774. [Google Scholar] [CrossRef]
- Schlette, D.; Caselli, M.; Pernul, G. A comparative study on cyber threat intelligence: The security incident response perspective. IEEE Commun. Surv. Tutor. 2021, 23, 2525–2556. [Google Scholar] [CrossRef]
- Piplai, A.; Mittal, S.; Joshi, A.; Finin, T.; Holt, J.; Zak, R. Creating cybersecurity knowledge graphs from malware after action reports. IEEE Access 2020, 8, 211691–211703. [Google Scholar] [CrossRef]
- Liu, K.; Wang, F.; Ding, Z.; Liang, S.; Yu, Z.; Zhou, Y. Recent progress of using knowledge graph for cybersecurity. Electronics 2022, 11, 2287. [Google Scholar] [CrossRef]
- Ismail, M.; Alrabaee, S.; Choo, K.K.R.; Ali, L.; Harous, S. A comprehensive evaluation of machine learning algorithms for web application attack detection with knowledge graph integration. Mob. Netw. Appl. 2024, 29, 1008–1037. [Google Scholar] [CrossRef]
- Zhang, K.; Liu, J. Review on the application of knowledge graph in cyber security assessment. IOP Conf. Ser. Mater. Sci. Eng. 2020, 768, 052103. [Google Scholar] [CrossRef]
- Pipyros, K.; Thraskias, C.; Mitrou, L.; Gritzalis, D.; Apostolopoulos, T. A new strategy for improving cyber-attacks evaluation in the context of Tallinn Manual. Comput. Secur. 2018, 74, 371–383. [Google Scholar] [CrossRef]
- Avdeeva, Z.; Gavrilov, M.; Lemtyuzhnikova, D.; Sharafiev, A. Methods for solving the problem of topic segmentation of texts based on knowledge graphs. J. Comput. Syst. Sci. Int. 2024, 63, 642–662. [Google Scholar] [CrossRef]
- Liu, K.; Wang, F.; Ding, Z.; Liang, S.; Yu, Z.; Zhou, Y. A review of knowledge graph application scenarios in cyber security. arXiv 2022, arXiv:2204.04769. [Google Scholar]
- Zhao, Q.; Liu, J.; Sullivan, N.; Chang, K.; Spina, J.; Blasch, E.; Chen, G. Anomaly detection of unstructured big data via semantic analysis and dynamic knowledge graph construction. In Proceedings of the Signal Processing, Sensor/Information Fusion, and Target Recognition XXX; SPIE: Bellingham, WA, USA, 2021; Volume 11756, pp. 126–142. [Google Scholar]
- Zhao, X.; Jiang, R.; Han, Y.; Li, A.; Peng, Z. A survey on cybersecurity knowledge graph construction. Comput. Secur. 2024, 136, 103524. [Google Scholar] [CrossRef]
- Jia, Y.; Qi, Y.; Shang, H.; Jiang, R.; Li, A. A practical approach to constructing a knowledge graph for cybersecurity. Engineering 2018, 4, 53–60. [Google Scholar] [CrossRef]
- Han, X.; Wang, L. A novel document-level relation extraction method based on BERT and entity information. IEEE Access 2020, 8, 96912–96919. [Google Scholar] [CrossRef]
- Mouiche, I.; Saad, S. Entity and relation extractions for threat intelligence knowledge graphs. Comput. Secur. 2025, 148, 104120. [Google Scholar] [CrossRef]
- Du, L.; Xu, C. Knowledge graph construction research from multi-source vulnerability intelligence. In Cyber Security. CNCERT 2022; Springer Nature: Singapore, 2022; pp. 177–184. [Google Scholar]
- Li, Z.; Cheng, J.; Yin, Q.; Xia, A.; Yan, L.; Li, S. Knowledge Graph Construction of Network Security Domain Based on Bi-LSTM-GNN. In Proceedings of the 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), Guangzhou, China, 20–22 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 220–225. [Google Scholar]
- Sangher, K.S.; Singh, A.; Pandey, H.M. LSTM and BERT based transformers models for cyber threat intelligence for intent identification of social media platforms exploitation from darknet forums. Int. J. Inf. Technol. 2024, 16, 5277–5292. [Google Scholar] [CrossRef]
- Hu, Y.; Zou, F.; Han, J.; Sun, X.; Wang, Y. Llm-tikg: Threat intelligence knowledge graph construction utilizing large language model. Comput. Secur. 2024, 145, 103999. [Google Scholar] [CrossRef]
- Zhang, Y.; Du, T.; Ma, Y.; Wang, X.; Xie, Y.; Yang, G.; Lu, Y.; Chang, E.C. AttacKG+: Boosting attack knowledge graph construction with large language models. arXiv 2024, arXiv:2405.04753. [Google Scholar] [CrossRef]
- Huang, L.; Xiao, X. CTIKG: LLM-Powered Knowledge Graph Construction from Cyber Threat Intelligence. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
- Paul, S.; Alemi, F.; Macwan, R. LLM-Assisted Proactive Threat Intelligence for Automated Reasoning. arXiv 2025, arXiv:2504.00428. [Google Scholar]
- Wu, Z.; Tang, F.; Zhao, M.; Li, Y. Kgv: Integrating large language models with knowledge graphs for cyber threat intelligence credibility assessment. arXiv 2024, arXiv:2408.08088. [Google Scholar]
- Brants, T. Natural Language Processing in Information Retrieval. Clinician 2003, 111, 1–13. [Google Scholar]
- Malik, V.; Sanjay, R.; Guha, S.K.; Hazarika, A.; Nigam, S.; Bhattacharya, A.; Modi, A. Semantic segmentation of legal documents via rhetorical roles. arXiv 2021, arXiv:2112.01836. [Google Scholar]
- Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
- Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar]
- Mishra, R.K.; Raj, H.; Urolagin, S.; Jothi, J.A.A.; Nawaz, N. Cluster-based knowledge graph and entity-relation representation on tourism economical sentiments. Appl. Sci. 2022, 12, 8105. [Google Scholar] [CrossRef]
- Tang, J.; Liu, Y.; Lin, K.y.; Li, L. Process bottlenecks identification and its root cause analysis using fusion-based clustering and knowledge graph. Adv. Eng. Inform. 2023, 55, 101862. [Google Scholar] [CrossRef]
- Zhang, Y.; Cheung, Y.M. Graph-based dissimilarity measurement for cluster analysis of any-type-attributed data. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 6530–6544. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.Z.; Jia, Y.T.; Xu, J.; Qiao, J.Z.; Cheng, X.Q. Modeling the correlations of relations for knowledge graph embedding. J. Comput. Sci. Technol. 2018, 33, 323–334. [Google Scholar] [CrossRef]
- Sabet, M.; Pajoohan, M.; Moosavi, M.R. Representation learning of knowledge graphs with correlation-based methods. Inf. Sci. 2023, 641, 119043. [Google Scholar] [CrossRef]
- Hempelmann, C.F.; Sakoglu, U.; Gurupur, V.P.; Jampana, S. An entropy-based evaluation method for knowledge bases of medical information systems. Expert Syst. Appl. 2016, 46, 262–273. [Google Scholar] [CrossRef]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).