KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications

Li, Qinman; Zhang, Xixiang; Liao, Weiming; Dai, Tao; Zheng, Hongliang; Yang, Beiya; Wang, Pengfei

doi:10.3390/electronics15020345

Open AccessArticle

KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications

by

Qinman Li

¹,

Xixiang Zhang

^1,*,

Weiming Liao

¹,

Tao Dai

^2,3,

Hongliang Zheng

^2,3,

Beiya Yang

⁴ and

Pengfei Wang

⁵

¹

Guangxi Power Grid Company, Nanning 530299, China

²

Electric Power Research Institute, China Southern Power Grid, Guangzhou 510700, China

³

Guangdong Provincial Key Laboratory of Power System Network Security, Guangzhou 510623, China

⁴

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

⁵

ANWHA (Shanghai) Automation Engineering Co., Ltd., Shanghai 201104, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 345; https://doi.org/10.3390/electronics15020345

Submission received: 28 November 2025 / Revised: 5 January 2026 / Accepted: 8 January 2026 / Published: 13 January 2026

Download

Browse Figures

Versions Notes

Abstract

With the acceleration of digital transformation, open-source software has become a fundamental component of modern smart grids and other critical infrastructures. However, the complex dependency structures of open-source ecosystems and the continuous emergence of vulnerabilities pose substantial challenges to software supply chain security. In power information networks and cyber–physical control systems, vulnerabilities in open-source components integrated into Supervisory Control and Data Acquisition (SCADA), Energy Management System (EMS), and Distribution Management System (DMS) platforms and distributed energy controllers may propagate along the supply chain, threatening system security and operational stability. In such application scenarios, large language models (LLMs) often suffer from limited semantic accuracy when handling domain-specific security terminology, as well as deployment inefficiencies that hinder their practical adoption in critical infrastructure environments. To address these issues, this paper proposes KD-SecBERT, a domain-specific semantic bidirectional encoder optimized through multi-level knowledge distillation for open-source software supply chain security in smart grid applications. The proposed framework constructs a hierarchical multi-teacher ensemble that integrates general language understanding, cybersecurity-domain knowledge, and code semantic analysis, together with a lightweight student architecture based on depthwise separable convolutions and multi-head self-attention. In addition, a dynamic, multi-dimensional distillation strategy is introduced to jointly perform layer-wise representation alignment, ensemble knowledge fusion, and task-oriented optimization under a progressive curriculum learning scheme. Extensive experiments conducted on a multi-source dataset comprising National Vulnerability Database (NVD) and Common Vulnerabilities and Exposures (CVE) entries, security-related GitHub code, and Open Web Application Security Project (OWASP) test cases show that KD-SecBERT achieves an accuracy of 91.3%, a recall of 90.6%, and an F1-score of 89.2% on vulnerability classification tasks, indicating strong robustness in recognizing both common and low-frequency security semantics. These results demonstrate that KD-SecBERT provides an effective and practical solution for semantic analysis and software supply chain risk assessment in smart grids and other critical-infrastructure environments.

Keywords:

knowledge distillation; bidirectional encoder; large language model; open-source software security; semantic vector model; supply chain risk management; smart grid cybersecurity

1. Introduction

With the continuous acceleration of digital transformation, open-source software supply chain security has become a key component of cybersecurity frameworks in highly sensitive industries such as the electric power sector and finance [1,2,3]. The 2023 State of Open Source Security report, jointly released by Snyk and the Linux Foundation, shows that 93% of enterprise codebases contain open-source components with known vulnerabilities [4]. Events such as Log4Shell have revealed serious deficiencies of traditional detection approaches at the semantic association level: existing methods struggle to effectively parse the complex relationships among vulnerability descriptions, patch code and dependency changes, leading to an average detection delay of up to 72 h for high-risk vulnerabilities [5]. For critical infrastructure operators, such delays imply prolonged exposure windows during which attackers can exploit unpatched vulnerabilities.

In modern smart grids and microgrids, open-source components are extensively integrated into Supervisory Control and Data Acquisition (SCADA)/Energy Management System (EMS)/Distribution Management System (DMS) platforms, substation gateways, distributed generation controllers, and energy storage management systems. Open-source software is also widely used in the middleware, databases, and web front ends that support power system control centers. Vulnerabilities in these components may propagate along the software supply chain, for example, from upstream libraries to vendor products and then to field devices, threatening the secure and stable operation of power information networks and cyber–physical control systems [1,2,6]. Therefore, efficient and accurate semantic understanding of vulnerability information, dependency metadata, and patch code is of great importance for risk assessment in software supply chains for power systems.

Beyond software-centric security analysis, a line of research has investigated the resilience of cyber–physical power and energy systems from a control-theoretic perspective. Event-triggered model predictive control enhanced by learning-based triggering policies has been proposed to improve control efficiency and robustness under uncertainty in cyber–physical systems [7]. To address adversarial threats, resilient self-triggered model predictive control schemes have been developed for discrete-time nonlinear systems under false data injection attacks, aiming to preserve system stability and operational safety [8]. In addition, secure and resilient control strategies against replay and deception attacks have been extensively studied to enhance the robustness of cyber–physical systems [9]. These control-oriented approaches focus on system-level resilience, providing insights that complement software supply chain semantic security analysis.

With the development of Transformer architectures [10], large language models (LLMs) have achieved breakthrough results in many domains due to their strong semantic understanding and knowledge transfer capabilities, exemplified by the Generative Pre-trained Transformer (GPT) series [11]. However, applying the powerful capabilities of LLMs directly to open-source software supply chain security remains a challenging open problem. In particular, current solutions based on pretrained LLMs face the following difficulties:

Domain adaptation limitations: General-purpose models have limited ability to capture Common Vulnerabilities and Exposures (CVE) vulnerability characteristics (e.g., Common Weakness Enumeration (CWE)-787 terminology) and security-specific semantic patterns. In the National Vulnerability Database (NVD), 78.4% of low-frequency professional terms (occurrence probability $< 10^{- 5}$ ) suffer from semantic drift, which leads to misinterpretation of vulnerability descriptions, inaccurate mapping between CVEs and affected components and reduced reliability of risk scoring in security-critical contexts.
Deployment efficiency bottlenecks: Large models such as BERT-base [12] and GPT-style decoders introduce significant latency and resource consumption in Continuous Integration/Continuous Deployment (CI/CD) pipelines and real-time scanning systems. In representative settings, single-scan delays can reach 1.2 s per sample on commodity hardware, which cannot meet the requirements of near-real-time vulnerability detection in smart grid operational environments where detection tasks must coexist with other time-sensitive control and monitoring processes.

To overcome these challenges, we propose a domain-specific semantic bidirectional encoder model optimized via multi-level knowledge distillation. By constructing a multi-teacher knowledge fusion framework that integrates general language understanding, security-domain pertaining, and code semantic analysis, and by designing a lightweight backbone architecture combining depthwise separable convolutions with multi-head attention, we develop KD-SecBERT, a specialized encoder tailored to open-source software supply chain security. A dynamic multi-dimensional knowledge distillation strategy is further introduced to realize efficient transfer of domain knowledge under a progressive curriculum learning scheme. The motivation for improving inference efficiency in this work is not driven by the absolute frequency of vulnerability disclosures, but by the need to scale semantic analysis across large and heterogeneous software dependency graphs in enterprise and critical infrastructure environments. KD-SecBERT therefore serves as a semantic support component within software supply chain risk assessment workflows, rather than a real-time detection system.

Compared with existing methods, our method achieves state-of-the-art performance on the CVE vulnerability classification task, reaching an F1-score of 89.2% with consistently high accuracy and recall. These results suggest that KD-SecBERT can provide practical semantic support for software supply chain risk assessment under the deployment constraints of power-system cybersecurity applications. In particular, the proposed model is suitable for integration into security operation centers, continuous integration systems, and dedicated supply chain risk monitoring platforms in smart grid environments. The main contributions of this work are summarized as follows:

1.: We propose a multi-level knowledge distillation framework for open-source software supply chain security, which fuses general language understanding, cybersecurity-domain knowledge and code semantic analysis into a unified representation, enabling robust modeling of vulnerability-related semantics that simultaneously cover natural language, program code and dependency metadata.
2.: We design a lightweight semantic bidirectional encoder backbone combining depthwise separable convolution and multi-head self-attention, significantly reducing model parameters and inference latency while maintaining strong representation capability. The resulting KD-SecBERT model is well-suited for deployment under the compute and latency constraints characteristic of smart grid environments.
3.: We develop a dynamic, multi-dimensional distillation strategy that incorporates layer-wise alignment, multi-teacher ensemble, task-oriented distillation, and curriculum learning to enhance domain knowledge transfer from teacher to student. This strategy improves training stability and allows the student model to progressively acquire domain expertise.
4.: We conduct extensive experiments on a multi-source dataset constructed from NVD/CVE entries, GitHub (Available online: https://github.com/ (accessed on 20 December 2025)) security-related code, and Open Web Application Security Project (OWASP) test cases, demonstrating that KD-SecBERT achieves superior performance, with strong practical value for smart grid software supply chain security and other critical infrastructure scenarios.

The remainder of this paper is organized as follows. Section 2 reviews related work on open-source software supply chain security, large language models, and knowledge distillation. Section 3 presents the proposed multi-level knowledge distillation framework. Section 4 details the specialized semantic bidirectional encoder optimization method. Section 5 reports experimental settings, comparative results, and ablation studies. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

2.1. Open-Source Software Supply Chain Security

Research on open-source software supply chain security has evolved along multiple dimensions in recent years, including component discovery, vulnerability management, integrity assurance, and governance. Commercial tools such as Synopsys Black Duck [13] apply Software Bill of Materials (SBOM) techniques to perform component fingerprinting and vulnerability detection. These tools typically rely on signature-based matching and vulnerability databases such as NVD, and they have been widely adopted in enterprises to support license compliance and risk assessment. However, they still suffer from relatively high false-positive rates (around 15%) and update delays, which hinder timely risk mitigation in rapidly evolving codebases.

In parallel, open frameworks and standards have been proposed for end-to-end supply chain assurance. Google’s SLSA (Supply-chain Levels for Software Artifacts) framework [14] constructs a comprehensive integrity assurance system for the software supply chain, specifying requirements on build provenance, artifact signing, and policy enforcement. SLSA can improve tampering detection rates to 99.5% in controlled environments, but its complex deployment requirements and integration costs limit adoption in many industrial contexts, including electric power enterprises where legacy systems and proprietary workflows are prevalent.

For critical information infrastructure, Li et al. [3] analyze software supply chain risks and corresponding mitigation strategies, emphasizing the importance of systematic risk identification, vulnerability management and security governance. Their work highlights typical attack paths such as compromised upstream libraries, malicious updates, and poisoned build environments. From the perspective of power information networks, Yang et al. [1] investigate the evolution of security risks under cyber attacks using wargaming techniques, and Lin et al. [2] summarize the full-process impact of cyber attacks on typical power system scenarios. These studies underscore that fine-grained and timely analysis of vulnerability information is a prerequisite for effective risk control in smart grids.

SBOMs and dependency graphs provide explicit and reliable representations of software component relationships. However, they do not address semantic ambiguity in vulnerability descriptions, inconsistent naming conventions across vendors, or indirect textual references to affected components that frequently appear in advisories and patch narratives. KD-SecBERT is therefore designed to complement SBOM-based approaches by enabling automated semantic interpretation and contextual reasoning over heterogeneous security-related textual artifacts, rather than enumerating explicit dependency links.

Overall, while existing approaches can effectively support component identification and integrity assurance, they are insufficient for capturing the semantic relationships among vulnerability descriptions, patch code, and dependency changes that frequently arise in real-world supply chain security scenarios. This limitation motivates the adoption of semantic models and large language models to enable deeper understanding and reasoning over heterogeneous security artifacts.

2.2. Large Language Models

In recent years, large language models (LLMs) have undergone rapid development. Devlin et al. [12] proposed BERT, a deep bidirectional Transformer encoder that achieved 7.5–15.2% performance gains on the GLUE benchmark and established the masked language modeling pretraining paradigm. BERT demonstrated that large-scale self-supervised pretraining can provide powerful representations for a wide range of downstream tasks via fine-tuning. Subsequently, model scaling became a major trend. Brown et al. [11] introduced GPT-3 with 175 billion parameters, validating empirical scaling laws and demonstrating impressive zero-shot and few-shot generalization capabilities. However, GPT-3 and similar models exhibit large performance variability on specialized tasks; for example, their F1-Scores on certain code-understanding tasks can be significantly lower than on pure natural-language benchmarks.

To bridge the gap between general LLMs and cybersecurity applications, Aghaei et al. [15] proposed SecureBERT, a domain-specific language model pretrained on cybersecurity corpora. After fine-tuning on CVE datasets, SecureBERT achieves an F1-score of 83.5% on vulnerability classification, 19.2 percentage points higher than general-purpose BERT, highlighting the importance of domain-specific pretraining. For code-related scenarios, Feng et al. [16] proposed CodeBERT, a pretrained model for programming and natural languages. By jointly modeling source code and natural language, CodeBERT achieves a Mean Reciprocal Rank (MRR) of up to 0.782 on code search tasks, and it has become a widely used backbone for code intelligence.

Despite these advances, existing models still face the following limitations in open-source software supply chain security. First, there is a lack of a comprehensive multimodal knowledge fusion framework that jointly models natural language descriptions, program code, and dependency metadata. Second, semantic fidelity for domain-specific terminology, such as rare vulnerability types, configuration flags, and versioning conventions, remains limited, leading to degraded performance when models are deployed on real-world vulnerability feeds. In addition, Zhang et al. [17] survey the applications, current status, and trends of ChatGPT in cybersecurity, pointing out promising opportunities for combining LLMs with threat intelligence, vulnerability analysis, and automated defense. However, they also stress that model compression and domain adaptation are crucial for practical deployment in resource-constrained and latency-sensitive environments, such as those found in power system operations [18].

2.3. Knowledge Distillation

Knowledge distillation has been widely studied as an effective approach for compressing large models into lightweight student models while retaining most of the teacher’s performance. Hinton et al. [19] first demonstrated that using softened teacher predictions (soft labels) as training targets can significantly improve student performance compared with training on hard labels alone. Building on this, Li et al. [20] introduced a curriculum temperature scheduling method that dynamically adjusts the distillation temperature during training to enhance stability and overall performance. Their results indicate that carefully designed distillation schedules can improve convergence and robustness.

In the field of code intelligence, Shi et al. [21] explore compressing pretrained models of code into models as small as 3 MB, using a combination of knowledge distillation and architecture optimization to maintain competitive performance. These works indicate that knowledge distillation can alleviate domain adaptation and deployment efficiency issues to some extent by transferring knowledge from large teacher models to specialized lightweight student models. However, most existing knowledge distillation approaches for language and code models focus on single-teacher settings or limited task supervision. For open-source software supply chain security, there is a need for multi-level knowledge distillation frameworks that integrate general language understanding, domain-specific security knowledge, and code semantics, while supporting multiple downstream tasks such as vulnerability detection and risk assessment.

The method proposed in this paper is inspired by the above research, but differs in three aspects. First, we explicitly combine multiple teacher models (general semantic, cybersecurity-domain, code semantic, and dependency-analysis experts) into a unified distillation framework. Second, we design a lightweight student architecture that is tailored to the latency and resource constraints of smart grid environments. Third, we incorporate curriculum learning and multi-task objectives into the distillation process to ensure that the resulting model is both compact and highly specialized for open-source software supply chain security.

3. Multi-Level Knowledge Distillation Framework

To address the challenges of applying large models to open-source software supply chain security, we propose a multi-level knowledge distillation framework and construct an efficient and accurate domain-specific semantic bidirectional encoder model, KD-SecBERT. The framework leverages (i) collaborative knowledge transfer from multiple teacher models, (ii) a finely optimized multi-dimensional distillation strategy, and (iii) a lightweight student architecture design to serve the semantic analysis needs of software supply chain security scenarios [22].

As shown in Figure 1, the framework consists of four key components: (1) a general semantic teacher capturing broad linguistic patterns and syntactic structures; (2) a cybersecurity-domain teacher encoding terminology, event structures and risk semantics from security texts; (3) a code semantic expert teacher providing structural and functional information about source code and patches; and (4) a dependency-analysis teacher modeling software component relationships and version compatibility. The outputs and intermediate representations of these teachers serve as rich supervision signals for the lightweight student encoder.

The framework is built on the following core principles:

1.: Multi-source knowledge fusion: Through hierarchical knowledge distillation, general language understanding is combined with domain-specific security knowledge to form a semantic representation system that balances generalization and specialization. For example, knowledge from a general language model such as BERT [12] is fused with that of a security-domain model such as SecureBERT [15] and a code model such as CodeBERT [16] to produce more accurate vulnerability representations.
2.: Lightweight and efficient design: A compact bidirectional encoder serves as the base architecture. Structural optimization and distilled refinement enable high-performance inference under limited computational resources. In particular, a hybrid architecture incorporating depthwise separable convolutions and multi-head self-attention reduces parameters and floating-point operations while preserving expressive power.
3.: Adaptive domain transfer: A flexible model transfer mechanism supports efficient adaptation from existing pretrained models to new domain applications, lowering the technical barrier for domain extension. A progressive curriculum learning strategy is employed, where the model first learns basic semantic representations from general corpora and then gradually acquires more complex security-domain knowledge.
4.: Multi-task collaborative learning: During knowledge distillation, multi-task learning objectives are introduced so that the model simultaneously adapts to multiple downstream tasks, enhancing both generality and task-specific discrimination. For instance, vulnerability detection and risk-level estimation tasks are jointly optimized to form a multi-objective learning problem, encouraging the shared encoder to capture features relevant to multiple aspects of software supply chain risk assessment.

By jointly enforcing these principles, the proposed framework produces a student model that not only compresses the size and latency of the teacher ensemble but also preserves and even enhances task performance in open-source software supply chain security.

4. Specialized Semantic Bidirectional Encoder Optimization Method

As illustrated in Figure 2, the proposed optimization method for the domain-specific semantic bidirectional encoder based on multi-level knowledge distillation consists of three main components: (1) a multi-teacher knowledge fusion system encompassing general language understanding, security-domain pretraining and code semantic analysis; (2) a lightweight backbone combining depthwise separable convolutions with multi-head attention as the student encoder; and (3) a dynamic, multi-dimensional distillation strategy for domain knowledge transfer.

4.1. Multi-Level Teacher Models

4.1.1. General Semantic Teacher

We adopt BERT-base [12], pretrained on large-scale general corpora, as the source of general language understanding. Through distillation from the general semantic teacher, the student model learns capabilities in long-text semantic modeling, syntactic structure representation, and generic concept association, which provide a solid foundation for downstream security tasks. In practice, we use both the final hidden states and selected intermediate-layer outputs of BERT-base as distillation targets.

4.1.2. Security-Domain Pretrained Teacher

A security-domain model, such as SecureBERT [15], or a model fine-tuned on security corpora from a general initialization, is used as the security-domain teacher. This teacher focuses on representing security texts such as vulnerability descriptions, security bulletins, and threat intelligence reports, capturing characteristic patterns in terminology, event structure, and risk semantics. For example, it can differentiate context-specific meanings of terms such as “remote code execution” or “privilege escalation”, which are critical for correct risk assessment.

4.1.3. Code Semantic Expert Teacher

We employ a pretrained model specialized in code representation learning, such as CodeBERT [16], as the code semantic teacher. This model provides an understanding of program structure, API call patterns, and code functionality. Information from abstract syntax trees (ASTs), data flow, and control flow is implicitly encoded in their representations, thereby enhancing the semantic modeling of vulnerability-related code snippets and patches. When distilling from this teacher, we feed code fragments and patches extracted from fix commits to guide the student encoder to capture semantic information that correlates with the presence or absence of vulnerabilities.

4.1.4. Dependency Analysis Expert Teacher

We further consider a teacher model specialized in modeling software component dependency relationships and version compatibility. This expert captures the semantics of package management metadata, such as package.json, pom.xml and requirements.txt, and provides structured representations of dependency graphs, which are crucial for supply chain risk analysis. Its outputs encode patterns such as vulnerable version ranges, transitive dependencies, and conflicting constraints, complementing the information extracted from natural language and code.

4.1.5. Multi-Teacher Ensemble Distillation

To fuse knowledge from multiple expert teachers, we design a teacher ensemble mechanism. Let

z_{i}

be the output logits from the i-th teacher model and

α_{i}

be its dynamic weight. The fused teacher output is computed as

z_{ens} = \sum_{i} α_{i} z_{i}, \sum_{i} α_{i} = 1 .

(1)

The student is then trained to match

z_{ens}

using a temperature-scaled KL divergence, which provides richer and more robust supervisory signals, alleviating the bias of any single teacher model. In practice, the weights

α_{i}

are stage-wise adaptive rather than fixed constants: at each curriculum stage, we compute

α_{i}

from each teacher’s effectiveness on a held-out validation set (e.g., F1-score), and normalize them to satisfy

\sum_{i} α_{i} = 1

. This yields a stable yet adaptive fusion without introducing per-sample weighting overhead. Intuitively, teachers specializing in security semantics are emphasized for narrative-heavy CVE descriptions, while code-oriented teachers receive higher weights for commit-related inputs.

4.2. Lightweight Student Model Architecture

As illustrated in Figure 3, the lightweight student model is composed of a bidirectional encoder backbone and multi-granularity feature-extraction modules. Architecturally, it achieves a deep integration of “lightweight and efficient design” with “domain knowledge enhancement”.

4.2.1. Multi-Granularity Feature Extraction Layer

We design a multi-granularity feature extraction layer to capture security semantics at different granularities, ranging from subword/word units to local phrases and the global sequence context. This is motivated by the fact that vulnerability descriptions and code-related texts exhibit high variability in length, structure, and semantic density, and often contain heterogeneous elements such as identifiers, version patterns, and abbreviated security terms. Residual connections are introduced to enhance information flow and gradient propagation, and a gated fusion mechanism is adopted to integrate complementary signals from different granularities. Finally, an adaptive pooling operation is applied to ensure representation consistency for inputs of varying lengths.

Architecture details. Let the contextual token embeddings be denoted as

H = {h_{1}, \dots, h_{L}} \in R^{L \times d}

, where L is the input length and d is the hidden dimension. Multi-granularity representations are constructed via different aggregation strategies over

H

: (i) the subword/word-level representation directly uses

H

to preserve fine-grained lexical and contextual semantics; (ii) the phrase-level representation aggregates consecutive token spans using sliding-window mean pooling to capture short-range compositional patterns; (iii) the sequence-level representation summarizes the entire sequence through global pooling to provide high-level semantics. All granularity-specific representations are projected into a shared latent space and fused through a gated fusion mechanism:

F = \sum_{g} σ (W_{g} H_{g}) ⊙ H_{g},

where

H_{g}

denotes the representation at granularity g,

W_{g}

is a learnable projection matrix,

σ (\cdot)

produces gate values, and ⊙ denotes element-wise multiplication. Residual connections are applied to preserve the original contextual information and stabilize optimization. Finally, an adaptive pooling layer is applied to

F

to obtain a fixed-length representation, ensuring consistent input dimensions for downstream classification regardless of input length variations. This design enables KD-SecBERT to jointly model fine-grained token semantics and higher-level compositional structures in security-related texts.

4.2.2. Lightweight Bidirectional Encoder Backbone

The backbone adopts a hybrid architecture of depthwise separable convolutions and multi-head self-attention. Depthwise separable convolutions [23] significantly reduce parameter count and computation by factorizing standard convolution into depthwise and pointwise components, while multi-head attention maintains strong long-range dependency modeling capabilities. This design allows the student model to capture both local patterns (e.g., token-level collocations in vulnerability descriptions) and global dependencies (e.g., relationships between different parts of a complex CVE record). The architecture is configured as follows:

Number of layers: 4–6 Transformer-style layers, compared to 12 layers in the original BERT, reducing depth and computation.
Hidden dimension: 384–512 hidden units, 30–50% lower than the 768 dimension used in BERT, further shrinking model size.
Number of attention heads: 6–8 attention heads are used to balance representation capacity and computational complexity.

Although a lightweight backbone may be less expressive for extremely long sequences, the typical input lengths in our dataset are well covered by the configured maximum sequence length, making the adopted design a suitable capacity-efficiency trade-off for our task. On top of the encoder, a simple classification head consisting of a pooling layer and a feed-forward layer is used for vulnerability classification and related downstream tasks.

4.3. Multi-Dimensional Knowledge Distillation

The Multi-Dimensional Knowledge Distillation is constructed to integrate structured domain knowledge, such as vulnerability taxonomies and software component relation graphs, into the representation process. Knowledge graph embedding techniques map entities and relations from professional knowledge bases into the model parameter space, providing auxiliary semantic signals for vulnerability understanding and supply chain risk modeling. During training, these embeddings are combined with textual representations via gating or attention mechanisms, allowing KD-SecBERT to leverage both unstructured and structured information.

4.3.1. Layer-Wise Alignment Distillation

We establish a layer-wise correspondence between teacher and student models and perform distillation on intermediate feature representations. An attention-based layer-mapping strategy allows a single student layer to align with one or more teacher layers. The layer-wise Kullback–Leibler (KL) divergence loss is defined as

L_{distill} = \sum_{l} KL (f_{T}^{(l)} ∥ f_{S}^{(ϕ (l))}),

(2)

where

f_{T}^{(l)}

and

f_{S}^{(ϕ (l))}

denote the l-th layer representations of the teacher and the mapped layer of the student, and

ϕ (\cdot)

is the layer mapping function. This alignment encourages the student to mimic not only the final outputs but also the intermediate reasoning patterns of the teachers.

4.3.2. Task-Oriented Distillation

In addition to mimicking teacher outputs, we incorporate downstream task objectives (e.g., vulnerability detection and risk assessment) into the distillation process, forming a multi-objective optimization problem. The overall loss function is defined as

L = λ_{d} L_{distill} + λ_{t} L_{task} + λ_{r} L_{reg},

(3)

where

L_{distill}

is the distillation loss (including layer-wise and ensemble components),

L_{task}

is the supervised task loss and

L_{reg}

is a regularization term (e.g., weight decay). The coefficients

λ_{d}

,

λ_{t}

and

λ_{r}

control the trade-offs among distillation strength, task specialization, and model regularization. This formulation allows the student model to balance fidelity to the teachers with direct optimization for target tasks.

4.3.3. Progressive Curriculum Learning

We adopt a curriculum learning strategy in the distillation process. The training is divided into three stages:

1.: General semantic distillation: The student first learns basic language representations from large-scale general corpora (e.g., Wikipedia and news texts), aligning with the general semantic teacher. At this stage, the focus is on reproducing generic syntactic and semantic patterns.
2.: Security knowledge transfer: The student then focuses on security-domain corpora such as the NVD vulnerability database, security advisories, and threat intelligence reports, distilling knowledge from security and code semantic teachers. This stage enables the model to internalize domain-specific terminology, event structures, and code patterns associated with vulnerabilities.
3.: Multi-task optimization: Finally, the student jointly optimizes vulnerability detection and risk assessment tasks on domain-specific downstream datasets, refining its representations under task-oriented supervision. The curriculum is gradually shifted towards task losses by increasing $λ_{t}$ and decreasing $λ_{d}$ .

This progressive curriculum helps the student avoid suboptimal local minima in the early stages and accelerates convergence, as confirmed by our experiments. It also provides a practical recipe for adapting existing pretrained models to new security domains with limited labeled data.

In this work, the term “dynamic” refers to the curriculum-driven adjustment of distillation emphasis across training stages, rather than to per-sample or per-iteration parameter updates. Specifically, different distillation objectives are emphasized at different stages of training as the student progressively acquires general semantic knowledge, domain-specific security semantics, and task-oriented discrimination capability. For training stability and reproducibility, the layer mapping function

ϕ (\cdot)

and the loss coefficients

(λ_{d}, λ_{t})

are kept fixed throughout training.

5. Experiments

5.1. Experimental Setup

Figure 4 conceptually illustrates the detection process used in our experiments. In the testing phase, the trained KD-SecBERT model is applied to samples from a target repository. For each sample, the model performs inference and outputs a vulnerability detection result (e.g., vulnerability category or risk level). The detection pipeline can be integrated into CI/CD workflows or operated as a standalone scanning service.

5.1.1. Dataset Description

We systematically collected multi-source data related to open-source software supply chain security, yielding the following datasets:

Public vulnerability databases: NVD and CVE entries from 2020 to 2023, totaling approximately 50,000 records, including vulnerability identifiers, textual descriptions, CVSS scores and affected components. These entries serve as the primary source of natural-language security information.
Code sources: Security-related code snippets, patches and fix commits from GitHub open-source repositories, totaling about 12,000 samples, including code patches linked to specific CVEs. These samples provide the code-level context needed to learn patterns associated with vulnerable and patched code.
OWASP test set: 1200 attack samples representative of common web and application vulnerabilities, which are used to evaluate the robustness of the model on standard vulnerability types and to test its generalization beyond the training distribution.

All textual data are collected from publicly available sources, including NVD/CVE records, security advisories, and open-source repositories. Prior to training, duplicate entries and incomplete records are removed. Vulnerability descriptions and related texts are normalized by lowercasing and removing special characters, and tokenization is performed using the same WordPiece tokenizer as the teacher models to ensure vocabulary consistency. For code-related artifacts, comments and non-semantic tokens are filtered, while preserving function names, identifiers, and control-flow–relevant tokens.

The combined dataset covers both natural-language descriptions and code-level artifacts, reflecting realistic conditions encountered in software supply chain security analysis for critical infrastructures such as smart grids. In our experiments, the data are randomly split into training, validation, and test sets with a ratio of 8:1:1, ensuring that CVE identifiers in the test set are not seen during training.

5.1.2. Experimental Environment

The experimental environment configuration is summarized in Table 1. The core computing resources include eight NVIDIA A100 (80 GB) GPUs, which are used to accelerate both teacher and student model training.

For fair comparison, all models are fine-tuned using the same training/validation splits and similar optimization settings. We adopt the AdamW optimizer with an initial learning rate of

2 \times 10^{- 5}

for BERT-based models and slightly higher learning rates for lighter models. Early stopping is applied based on the validation F1-score to prevent overfitting.

5.2. Evaluation Metrics

We evaluate model performance using the following metrics:

Accuracy: The proportion of correctly classified samples among all samples, reflecting the overall classification correctness.
Recall: The proportion of true positive samples that are correctly identified, reflecting the model’s ability to capture vulnerability instances and avoid missed detections.
F1-score: The harmonic mean of precision and recall, reflecting the balanced performance of the model in recognizing positive and negative samples. This is the primary metric for vulnerability classification quality.
Number of parameters (M): The total number of trainable parameters in the model, measured in millions (M), which provides an approximate indicator of model size and memory footprint.
Inference latency (ms): The end-to-end time required for processing a single sample from input to output, measured in milliseconds. During testing, a fixed hardware environment and batch size of 1 are used to compute the average inference time per sample, which reflects the practicability of deployment in latency-sensitive environments.

In addition to these global metrics, we also conduct qualitative analysis of representative true positives and false positives/negatives to understand how the model behaves on different vulnerability types and how well it handles rare domain terminology.

5.3. Performance Comparison

To verify the effectiveness of the proposed KD-SecBERT method for open-source software vulnerability detection, we compare it with several existing models on the constructed multi-source dataset. The results are presented in Table 2.

From Table 2, we observe that the proposed KD-SecBERT achieves the best overall performance across multiple evaluation metrics. In addition to obtaining the highest F1-score of 89.2%, KD-SecBERT also achieves an accuracy of 91.6%, a precision of 89.5%, and a recall of 88.9%, indicating a well-balanced detection capability for software vulnerability classification. Meanwhile, the model reduces the number of parameters to 19.3 M and the inference latency to only 47 ms.

Compared with BERT-base, KD-SecBERT improves the F1-score by 4.5 percentage points, reduces model parameters by approximately 82.6%, and accelerates inference speed by approximately 6.8×. Compared with other strong baselines, including CodeBERT and SecureBERT, KD-SecBERT consistently achieves higher accuracy and recall while maintaining significantly lower computational overhead. These results demonstrate that the proposed multi-level knowledge distillation framework effectively transfers domain knowledge into a lightweight student model, enabling both high detection accuracy and deployment efficiency. Such characteristics are particularly important for real-world software supply chain security scenarios, including power system environments, where computational resources at substations and control centers are often constrained.

In addition to domain-specific baselines, we further compare KD-SecBERT with general-purpose compressed language models, namely DistilBERT [24] and TinyBERT [25], which represent widely adopted knowledge distillation and model compression techniques. All models are fine-tuned and evaluated on the same dataset under identical experimental settings. Although DistilBERT and TinyBERT substantially reduce parameters and inference latency compared to BERT-base, their performance remains inferior to KD-SecBERT across all evaluation metrics. This highlights that generic model compression alone is insufficient for complex security semantics, whereas domain-aware multi-teacher distillation enables KD-SecBERT to achieve superior accuracy–efficiency trade-offs.

Qualitative analysis further shows that KD-SecBERT is particularly effective when vulnerability descriptions contain low-frequency security terminology, uncommon abbreviations, or vendor-specific expressions. In such cases, general-purpose models often misclassify vulnerability categories or underestimate severity levels, while KD-SecBERT benefits from the complementary supervision of security-domain and code-semantic teachers, resulting in more robust semantic understanding.

5.4. Ablation Studies

To evaluate the contributions of individual components in KD-SecBERT, we conduct ablation experiments under different model configurations. The results are summarized in Table 3.

Based on the ablation results in Table 3, several key findings can be summarized as follows:

Multi-teacher distillation: Removing the multi-teacher distillation mechanism leads to the most significant performance degradation. The F1-score drops from 89.2% to 83.7%, accompanied by a notable decrease in recall from 88.9% to 82.1%. This indicates that multi-teacher distillation plays a critical role in improving the model’s ability to identify diverse and low-frequency vulnerability patterns. By aggregating complementary supervision from general-language, security-domain, and code-semantic teachers, the student model achieves more robust coverage of complex vulnerability descriptions spanning both natural-language and code contexts.
Multi-granularity features: Removing the multi-granularity feature module reduces the F1-score by 3.8 percentage points (from 89.2% to 85.4%) and lowers recall from 88.9% to 84.0%, while slightly reducing the parameter count by 1.5 M. These results suggest that although multi-granularity features introduce additional parameters, the enriched hierarchical representations substantially enhance the model’s ability to capture both local token-level cues and global semantic dependencies, which is particularly beneficial for long vulnerability descriptions and multi-sentence advisories.
Curriculum learning: Disabling curriculum learning results in a moderate performance decline, with the F1-score decreasing from 89.2% to 86.9% and recall dropping from 88.9% to 86.3%, while the parameter count remains unchanged. This indicates that curriculum learning primarily contributes to training stability and generalization rather than representational capacity. Empirically, curriculum learning helps the model avoid suboptimal local minima during early training and mitigates overfitting to high-frequency patterns, leading to improved robustness on less common vulnerability types.

Overall, the ablation results confirm that each component of the proposed KD-SecBERT framework contributes positively to the final performance, with multi-teacher distillation having the most pronounced impact on recall and overall detection capability. The full configuration achieves the best balance between accuracy, robustness, and efficiency. At the same time, the results suggest that, in deployment scenarios with stricter resource constraints, simplified variants (e.g., removing multi-granularity features) may offer a reasonable trade-off between performance and model complexity.

6. Conclusions

In this paper, we propose KD-SecBERT, a domain-specific semantic bidirectional encoder optimization method based on multi-level knowledge distillation, aimed at addressing the challenges faced by large language models in open-source software supply chain security applications, especially in critical infrastructures such as smart grids. By constructing a multi-teacher knowledge fusion framework that integrates general language understanding, security-domain pertaining, and code semantic analysis, and by designing a lightweight backbone with depthwise separable convolutions and multi-head attention, the proposed model significantly improves domain-specific term recognition while reducing deployment cost.

Extensive experiments demonstrate that, compared with existing methods, KD-SecBERT achieves an accuracy of 91.3%, a recall of 90.6%, and an F1-score of 89.2% on vulnerability classification tasks, indicating strong robustness in recognizing both common and low-frequency security semantics. These results demonstrate that KD-SecBERT provides an effective and practical solution for semantic analysis and software supply chain risk assessment in smart grids and other critical-infrastructure environments. These results indicate that KD-SecBERT can provide efficient semantic understanding support for software supply chain risk assessment, offering strong practical value for enhancing the cybersecurity posture of smart grid and other critical information infrastructures. The ablation studies further verify the effectiveness of the multi-teacher distillation, multi-granularity feature extraction, and curriculum learning components.

Although the experimental evaluation in this work focuses on smart grid software supply chain data, the core components of KD-SecBERT, such as CVE text analysis, code semantic modeling, and dependency-aware contextual representation, are inherently domain-agnostic. These components are shared across other critical infrastructures, including healthcare and finance, which similarly rely on open-source software ecosystems. Therefore, the proposed framework can be adapted to other domains given appropriate domain-specific data.

We note that KD-SecBERT is not designed to predict zero-day vulnerabilities themselves, which by definition lack labeled data at the time of disclosure. Instead, the proposed model focuses on efficient semantic representation and understanding of vulnerability descriptions once such information becomes available. By distilling knowledge from general language, security-domain, and code semantic teachers, the learned representations retain robustness to newly emerging vulnerability patterns.

There are several directions for future work. First, we plan to extend KD-SecBERT to broader multi-modal scenarios, including joint modeling of log data, network traffic, and configuration files, which would enable more comprehensive risk assessment at the cyber–physical system level. Second, we intend to integrate dynamic attack graphs and power-system operational states to further enhance the model’s ability to support real-time risk assessment and decision-making in cyber–physical power systems. Third, we will investigate continual learning and online distillation mechanisms so that KD-SecBERT can keep pace with the rapid evolution of vulnerability landscapes and open-source ecosystems without requiring complete retraining.

Author Contributions

Conceptualization, X.Z. and Q.L.; methodology, X.Z.; software, Q.L.; validation, Q.L., X.Z., and W.L.; formal analysis, Q.L.; investigation, Q.L. and B.Y.; resources, W.L.; data curation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, X.Z., T.D., H.Z., and P.W.; visualization, Q.L. and B.Y.; supervision, X.Z. and T.D.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are derived from publicly available sources, including NVD/CVE records and open-source software repositories. Preprocessing scripts, data schemas, and experimental configuration files are available from the authors upon reasonable request for academic research purposes.

Conflicts of Interest

Authors Qinman Li, Xixiang Zhang, and Weiming Liao were employed by Guangxi Power Grid Company. Authors Tao Dai and Hongliang Zheng were employed by the Electric Power Research Institute, China Southern Power Grid. Author Beiya Yang was employed by the School of Computer Science, Northwestern Polytechnical University. Author Pengfei Wang was employed by ANWHA (Shanghai) Automation Engineering Co., Ltd. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Yang, Y.; Wu, J.; Gao, Z.; Liang, Z.; Hong, C.; Li, P.; Zhang, Y. Wargaming techniques for risk evolution of power information network security. South. Power Grid Technol. 2025, 19, 51–61. (In Chinese) [Google Scholar]
Lin, F.; Mei, Y.; Zhu, Y.; Chang, D.; Liu, R.; Guo, H. A survey on the full-process impact of cyber attacks on typical power system scenarios. South. Power Grid Technol. 2023, 17, 61–75. (In Chinese) [Google Scholar]
Li, Z.; Guo, C.; Tang, W.; Yang, S.; Wang, X. Risk analysis and countermeasures for software supply chains in critical information infrastructure. Inf. Secur. Res. 2024, 10, 833–839. (In Chinese) [Google Scholar]
Snyk. The Linux Foundation. State of Open Source Security. 2023. Available online: https://snyk.io/reports/open-source-security/ (accessed on 6 June 2023).
Mend.io. Open Source Risk Report. 2023. Available online: https://www.mend.io/resources/research-reports/ (accessed on 1 July 2023).
Zhou, Y.; Zhai, Q.; Xu, Z.; Wu, L.; Guan, X. Multi-stage adaptive stochastic–robust scheduling method with affine decision policies for hydrogen-based multi-energy microgrid. IEEE Trans. Smart Grid 2024, 15, 2738–2750. [Google Scholar] [CrossRef]
Yoo, J.; Johansson, K.H. Event-Triggered Model Predictive Control With a Statistical Learning. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 2571–2581. [Google Scholar] [CrossRef]
He, N.; Ma, K.; Li, H.; Li, Y. Resilient Self-Triggered Model Predictive Control of Discrete-Time Nonlinear Cyberphysical Systems Against False Data Injection Attacks. IEEE Intell. Transp. Syst. Mag. 2024, 16, 23–36. [Google Scholar] [CrossRef]
Mo, Y.; Sinopoli, B. Secure Control Against Replay Attacks. IEEE Trans. Autom. Control 2015, 60, 2096–2107. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Synopsys. Black Duck Software Composition Analysis. Available online: https://www.synopsys.com/ (accessed on 1 May 2023).
OpenSSF. SLSA Framework v1.0. Available online: https://slsa.dev/spec/v1.0/ (accessed on 19 April 2023).
Aghaei, E.; Niu, X.; Shadid, W.; Al-Shaer, E. SecureBERT: A domain-specific language model for cybersecurity. In Security and Privacy in Communication Networks, 18th EAI International Conference, SecureComm 2022, Virtual, 17–19 October 2022; Proceedings; Springer: Berlin/Heidelberg, Germany, 2023; pp. 39–56. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
Zhang, C.; Weng, F.; Zhang, Y. Applications, current status and trends of ChatGPT in cybersecurity. Inf. Secur. Res. 2023, 9, 500–509. (In Chinese) [Google Scholar]
Zhou, Y.; Han, Z.; Zhai, Q.; Wu, L.; Cao, X.; Guan, X. A data-and-model-driven acceleration approach for large-scale network-constrained unit commitment problem with uncertainty. IEEE Trans. Sustain. Energy 2025, 16, 2299–2311. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Li, Z.; Li, X.; Yang, L.; Zhao, B.; Song, R.; Luo, L.; Li, J.; Yang, J. Curriculum temperature for knowledge distillation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 1504–1512. [Google Scholar] [CrossRef]
Shi, J.; Yang, Z.; Xu, B.; Kang, H.J.; Lo, D. Compressing pre-trained models of code into 3 MB. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE 2022); ACM: Rochester, MI, USA, 2022; p. 24. [Google Scholar]
Zhao, J.; Zhai, Q.; Zhou, Y.; Cao, X.; Guan, X. Explicit Modeling of Multi-Energy Complementarity Mechanism for Uncertainty Mitigation: A Multi-Stage Robust Optimization Approach for Energy Management of Hydrogen-Based Microgrids. SSRN Working Paper. 2025. Available online: https://ssrn.com/abstract=5325853 (accessed on 20 December 2025).
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for natural language understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]

Figure 1. Conceptual architecture of the KD-SecBERT domain-specific semantic bidirectional encoder model optimized via multi-level knowledge distillation.

Figure 2. Overall framework of KD-SecBERT for software supply chain semantic analysis.

Figure 3. Conceptual structure of the lightweight student model in KD-SecBERT.

Figure 4. Conceptual workflow of vulnerability detection using KD-SecBERT.

Table 1. Experimental environment configuration.

Component	Configuration
CPU	Intel Xeon Platinum 8375C (32 cores)
Memory	1 TB DDR4
GPU	8 × NVIDIA A100 (80 GB)
Operating system	Ubuntu 22.04 LTS
Framework	Python 3.10 + PyTorch 2.3.0

Table 2. Performance comparison with existing methods.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Parameters (M)	Latency (ms)
BERT-base [12]	88.9	85.3	84.1	84.7	110	320
DistilBERT [24]	89.6	86.8	85.2	86.0	66	186
TinyBERT (4L) [25]	88.5	85.9	84.9	85.4	14.5	40
CodeBERT [16]	90.0	86.9	85.8	86.3	125	350
SecureBERT [15]	90.8	88.2	87.6	87.9	134	380
KD-SecBERT (ours)	91.6	89.5	88.9	89.2	19.3	47

Table 3. Ablation study of KD-SecBERT components.

Model Variant	Accuracy (%)	Recall (%)	F1-Score (%)	Parameters (M)
Full model	91.6	88.9	89.2	19.3
Without multi-teacher distillation	87.4	82.1	83.7	19.1
Without multi-granularity features	88.6	84.0	85.4	17.8
Without curriculum learning	89.9	86.3	86.9	19.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Q.; Zhang, X.; Liao, W.; Dai, T.; Zheng, H.; Yang, B.; Wang, P. KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications. Electronics 2026, 15, 345. https://doi.org/10.3390/electronics15020345

AMA Style

Li Q, Zhang X, Liao W, Dai T, Zheng H, Yang B, Wang P. KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications. Electronics. 2026; 15(2):345. https://doi.org/10.3390/electronics15020345

Chicago/Turabian Style

Li, Qinman, Xixiang Zhang, Weiming Liao, Tao Dai, Hongliang Zheng, Beiya Yang, and Pengfei Wang. 2026. "KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications" Electronics 15, no. 2: 345. https://doi.org/10.3390/electronics15020345

APA Style

Li, Q., Zhang, X., Liao, W., Dai, T., Zheng, H., Yang, B., & Wang, P. (2026). KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications. Electronics, 15(2), 345. https://doi.org/10.3390/electronics15020345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KD-SecBERT: A Knowledge-Distilled Bidirectional Encoder Optimized for Open-Source Software Supply Chain Security in Smart Grid Applications

Abstract

1. Introduction

2. Related Work

2.1. Open-Source Software Supply Chain Security

2.2. Large Language Models

2.3. Knowledge Distillation

3. Multi-Level Knowledge Distillation Framework

4. Specialized Semantic Bidirectional Encoder Optimization Method

4.1. Multi-Level Teacher Models

4.1.1. General Semantic Teacher

4.1.2. Security-Domain Pretrained Teacher

4.1.3. Code Semantic Expert Teacher

4.1.4. Dependency Analysis Expert Teacher

4.1.5. Multi-Teacher Ensemble Distillation

4.2. Lightweight Student Model Architecture

4.2.1. Multi-Granularity Feature Extraction Layer

4.2.2. Lightweight Bidirectional Encoder Backbone

4.3. Multi-Dimensional Knowledge Distillation

4.3.1. Layer-Wise Alignment Distillation

4.3.2. Task-Oriented Distillation

4.3.3. Progressive Curriculum Learning

5. Experiments

5.1. Experimental Setup

5.1.1. Dataset Description

5.1.2. Experimental Environment

5.2. Evaluation Metrics

5.3. Performance Comparison

5.4. Ablation Studies

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI