Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering

Zhu, Liangliang; Wiangsamut, Samruan; Polpinij, Jantima

doi:10.3390/app16126026

Open AccessArticle

Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering

by

Liangliang Zhu

,

Samruan Wiangsamut

and

Jantima Polpinij

^*

Department of Computer Science, Faculty of Informatics, Mahasarakham University, Kantarawichai District, Maha Sarakham 44150, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6026; https://doi.org/10.3390/app16126026 (registering DOI)

Submission received: 4 May 2026 / Revised: 23 May 2026 / Accepted: 12 June 2026 / Published: 14 June 2026

(This article belongs to the Special Issue Applied Artificial Intelligence and Software Engineering)

Download

Browse Figure

Versions Notes

Featured Application

The proposed approach can be used in automated bug triage systems in software engineering, supporting severity prediction across projects where labeled data are limited, while offering a favorable trade-off between predictive performance and computational efficiency.

Abstract

Bug severity classification is an important task in software maintenance because it supports bug triage and resource allocation. However, newly created or evolving projects often lack sufficient labeled data, making cross-project severity prediction challenging due to domain shift and class imbalance. In this paper, we investigate cross-project bug severity classification using lightweight transformer models under practical deployment constraints. Specifically, DistilBERT and TinyBERT are employed and evaluated within a unified cross-project learning framework. Experiments are conducted on large-scale Mozilla bug repositories under both single-source and multi-source transfer settings. Macro-averaged F1 is used as the primary evaluation metric to ensure balanced assessment across severity levels. The results indicate that cross-project performance is strongly influenced by source–target pairing, reflecting the impact of domain shift. Multi-source training generally improves performance across several transfer scenarios, particularly for minority severity classes, although the improvements remain moderate. DistilBERT achieves higher overall performance, whereas TinyBERT shows comparable trends with only a small reduction in Macro-F1, suggesting a favorable trade-off between predictive performance and model efficiency. These findings suggest that lightweight transformer models can support practical bug triage processes by providing relatively consistent and computationally efficient severity predictions across projects, particularly in environments with limited labeled data and computational resources.

Keywords:

bug severity classification; cross-project learning; lightweight transformer models; software maintenance; applied artificial intelligence; software engineering; bug triage

1. Introduction

Software maintenance is one of the most important and costly phases of the software lifecycle, particularly in large-scale systems that continuously evolve over time [1,2]. During maintenance, developers must handle a large number of bug reports submitted through issue tracking platforms such as Bugzilla, JIRA, and GitHub Issues. These reports contain both structured metadata and unstructured textual descriptions that provide useful information for software analysis tasks [3,4].

Among maintenance activities, bug severity classification plays an important role because severity levels influence resource allocation, scheduling, and release planning [1,4,5,6]. However, manually assigning severity labels is time-consuming and often inconsistent, especially when repositories contain thousands of reports. As a result, many studies have explored automated approaches for bug severity classification using machine learning and artificial intelligence techniques [4,5,6,7,8,9,10].

Early studies mainly relied on traditional machine learning models with statistical text representations such as Bag-of-Words and TF-IDF [7,8]. Later research introduced distributed word embeddings and deep learning models, including Convolutional Neural Networks and recurrent architectures, to capture richer semantic information from bug descriptions [4]. More recently, transformer-based language models have shown promising results in several software engineering tasks, including bug severity prediction [9,10].

Despite these advances, most existing studies evaluate models using bug reports collected from the same software project [11,12]. Although this setting is useful for controlled experiments, it may not fully reflect real software maintenance environments. In practice, many projects contain only limited labeled data, making it necessary to transfer models across repositories [13,14]. However, differences in vocabulary, reporting style, software structure, and severity distributions can substantially reduce classification performance when models are applied to unseen projects [15,16].

Transformer-based models are also affected by this problem. Models fine-tuned on one repository may not generalize well to another repository with different characteristics [17,18,19]. For this reason, domain adaptation techniques have attracted increasing attention in related software engineering tasks, particularly cross-project defect prediction and bug triage [15,16]. These approaches aim to reduce distributional differences between source and target domains by learning more transferable feature representations.

However, only limited studies have systematically investigated transformer-based models together with domain adaptation techniques for cross-project bug severity classification [20,21]. Existing work often focuses primarily on predictive performance while giving less attention to transfer stability and computational efficiency in practical deployment scenarios. In addition, empirical evaluations using heterogeneous real-world software repositories remain relatively limited.

To address these issues, this study investigates lightweight transformer-based models for cross-project bug severity classification using Mozilla bug repositories. The study examines both single-source and multi-source transfer settings and evaluates whether representation-level domain adaptation can improve classification performance across projects with different reporting characteristics. In addition to predictive performance, the study also considers computational efficiency to better reflect practical software maintenance conditions.

Rather than proposing a new transformer architecture, this work provides an empirical analysis of how lightweight transformer models behave under cross-project transfer settings. The goal is to better understand their generalization capability, transfer stability, and practical applicability across heterogeneous software repositories.

Based on these objectives, the study investigates the following research questions:

RQ1: How effectively can lightweight transformer models perform bug severity classification under single-source transfer settings?

RQ2: Does multi-source training improve cross-project classification performance compared with single-source learning?

RQ3: To what extent can representation-level domain adaptation reduce the impact of distributional differences across software projects?

RQ4: How do lightweight transformer models balance predictive performance and computational efficiency under practical software maintenance conditions?

Collectively, these research questions aim to provide a clearer empirical understanding of transformer-based bug severity classification under cross-project software maintenance environments.

The remainder of this paper is organized as follows. Section 2 introduces the datasets, and Section 3 presents the cross-project experimental design. Section 4 describes the proposed method. Section 5 presents the experimental settings and evaluation results, followed by discussion of the findings. Finally, Section 6 concludes the study.

2. Dataset Description

To answer the research questions, this study uses an open-source Mozilla bug report dataset that has been widely used in previous studies on automated bug severity classification [5,6]. The dataset combines bug reports from several Mozilla-related projects, including Core, Firefox, Thunderbird, and Bugzilla. Each bug report contains both structured metadata and unstructured textual content.

In this study, the textual input is derived from the summary field (or title), which provides a short description of the reported defect. Previous studies have shown that bug report summaries contain useful information for severity prediction tasks [5,7,8].

The decision to use only the summary/title field is motivated by both practical and methodological considerations. In real software maintenance environments, summaries are usually available earlier and are generally more concise and standardized than full bug descriptions. Prior studies on bug severity classification and bug triage have also used summary-based input because titles often contain the core defect information while introducing less noisy and less redundant text than full descriptions [5,7,8].

By contrast, full bug descriptions may include lengthy discussions, stack traces, environmental settings, reproduction steps, or developer interactions that are not always directly related to severity prediction. Such information may increase textual variability across projects. In addition, shorter textual input reduces sequence length and computational cost during model training, which is beneficial for lightweight transformer models.

Nevertheless, restricting the input to summaries may reduce the amount of contextual information available to the classifier. Full bug descriptions may still contain additional severity-related cues in some cases. Therefore, this study focuses on title-based severity classification under transfer settings rather than attempting to maximize predictive performance using all available textual fields. Future work may further compare title-only and combined title-and-description representations.

Bug severities are categorized into five levels: “trivial”, “minor”, “major”, “critical”, and “blocker”, following Mozilla’s standard severity scale. These labels represent different levels of impact on system functionality, ranging from minor cosmetic issues to severe defects that substantially affect system operation.

The complete dataset contains 49,354 bug reports. Unlike some previous studies that report the Mozilla dataset only at an aggregated level, this study explicitly parses project identifiers from the original Bugzilla records to compute per-project and per-severity statistics. Therefore, the distributions reported in Table 1 reflect the actual data used in all experiments.

Firefox and Core contribute the largest number of bug reports because both projects are relatively large and complex. Thunderbird contains a moderate number of reports, while Bugzilla contributes comparatively fewer instances. The distribution of bug reports across the four projects is summarized in Table 1.

From the preprocessing perspective, this study adopts a moderate cleaning strategy rather than aggressively removing all potentially noisy textual elements. Text is converted to lowercase, and punctuation symbols are removed to reduce unnecessary variability. At the same time, domain-specific terms and informative tokens are retained because previous software engineering studies have suggested that excessive preprocessing may remove useful semantic information.

To address class imbalance, no re-sampling techniques are applied because these methods may distort the original severity distribution. Instead, imbalance is handled through stratified data partitioning and class-sensitive evaluation metrics, particularly weighted F1-score and macro-averaged F1-score. This approach supports balanced evaluation across both majority and minority severity classes while preserving the original class distribution.

Although all datasets originate from the Mozilla Bugzilla ecosystem, the projects represent independently developed software systems with different reporting styles, development practices, and severity distributions. These differences create meaningful transfer scenarios for empirical evaluation across projects.

The Mozilla repositories are therefore suitable for evaluating transfer performance under heterogeneous reporting conditions, which are commonly discussed challenges in previous cross-project software engineering studies [15,16].

Notably, the Bugzilla project differs from the other repositories in both severity distribution and reporting style, which makes transfer learning more challenging.

As shown in Table 1, Bugzilla contains substantially fewer critical and blocker reports than the other projects. This imbalance increases the difficulty of learning stable decision boundaries for high-severity classes, particularly when source and target projects have different severity distributions.

The observed differences in severity distribution and reporting characteristics also suggest a distribution mismatch between projects, which may contribute to lower transfer performance. As a result, Bugzilla becomes a more difficult target domain compared with the other repositories.

Although explicit measures such as vocabulary divergence or embedding-space distance are not computed in this study, the observed differences across projects still provide practical evidence of domain variation. Future work may incorporate more explicit quantitative measures of domain divergence to further investigate these observations. Figure 1 shows an example of a bug report.

3. Cross-Project Experimental Design

To evaluate bug severity classification across different software projects, this study adopts a cross-project experimental design. The model is trained using bug reports from one or multiple source projects and evaluated on a different target project. The experiments use Mozilla Bugzilla repositories collected from four projects: Core, Firefox, Thunderbird, and Bugzilla.

Each project exhibits different reporting styles, vocabulary usage patterns, and severity label distributions. This setting reflects practical software maintenance scenarios in which a classifier trained on a data-rich project may need to be transferred to another project with limited labeled data or different reporting characteristics.

To examine model performance under different transfer conditions, two evaluation protocols are used.

The first protocol follows a single-source to single-target transfer setting. For each project pair

(S \to T)

, where

S \neq T

, the model is trained using the training set of source project

S

and evaluated on the held-out test set of target project

T

. Using four projects results in twelve directed transfer settings.

The transfer process is directional rather than symmetrical. In other words, transferring from project A to project B may produce different results from transferring from project B to project A. For example, a model trained on Firefox and evaluated on Bugzilla may behave differently from a model trained on Bugzilla and evaluated on Firefox. These differences are associated with variations in dataset size, reporting style, vocabulary characteristics, and severity distributions across projects.

The second protocol follows a multi-source to single-target transfer setting. For a target project

T

, the model is trained using the combined data from the remaining three projects

\{P \ T\}

and evaluated on

T

. This results in four leave-one-project-out experimental settings.

This protocol allows the study to examine whether combining multiple source projects improves transfer performance across different repositories. Exposure to more diverse training data may also help reduce overfitting to project-specific vocabulary patterns.

To reduce evaluation bias caused by class imbalance, each project is divided into training and testing sets using stratified sampling based on severity labels. This process preserves the relative class distributions across all partitions. An 80:20 train–test split is applied consistently across all experiments to support direct model comparison.

No re-sampling methods, such as random over-sampling or under-sampling, are applied because these techniques may distort the original severity distribution. Instead, imbalance is handled through class-sensitive training and evaluation procedures. During training, class weighting is used to reduce the dominance of majority classes. During evaluation, both minority-class and overall classification performance are considered.

Model performance is primarily evaluated using macro-averaged F1-score because it assigns equal importance to all severity classes, including minority categories such as blocker and trivial. Weighted F1-score and accuracy are also reported as complementary metrics. Weighted F1-score reflects performance under the original class distribution, while accuracy provides an overall performance indication.

In addition, per-class precision, recall, and F1-score are reported to examine whether performance improvements are distributed across different severity categories or concentrated mainly on dominant classes such as major and critical. This analysis is important because rare but high-severity bugs may still have substantial operational impact in software maintenance environments.

For consistent comparison, results across multiple random seeds are reported as mean ± standard deviation using stratified data splits for all experimental settings. Statistical significance between transformer-based domain adaptation models and benchmark methods is evaluated using the Wilcoxon signed-rank test.

The experimental design is also used to examine transfer behavior across projects rather than serving only as a model comparison framework. Because the Mozilla projects differ in project size and severity distribution, transfer results are interpreted together with descriptive dataset statistics to provide additional insight into challenging transfer settings.

Both single-source and multi-source training configurations are investigated throughout the experiments. All train–test transfer settings are summarized in Table 2.

4. Research Methodology

This section describes the experimental framework used for bug severity classification across different software projects. The proposed approach combines transformer-based representation learning with domain adaptation techniques to improve transfer performance between projects.

The objective is to train a severity classifier using one or more source projects and evaluate its performance on a different target project with limited labeled data. The overall methodology consists of four main steps: text preprocessing, transformer-based representation learning, severity classification, and domain alignment.

4.1. Text Pre-Processing

Each bug report is represented using its summary field (or title), which provides a brief description of the reported defect. Compared with full bug descriptions, summaries contain less contextual detail but are generally more concise and standardized [22,23]. This setting reflects early-stage bug triage scenarios in which detailed diagnostic information may not yet be available.

Using shorter textual input also reduces sequence length and computational cost during model training. This design choice is consistent with recent studies that have examined the application of transformer-based language models in software engineering tasks using bug-report text and other software artifacts [22,23]. Therefore, the study focuses on title-based representations throughout all experiments.

Preprocessing is intentionally kept relatively simple to preserve potentially useful technical information for severity prediction. In all experiments, text is converted to lowercase, punctuation symbols that do not contribute semantic meaning are removed, and redundant whitespace is collapsed. At the same time, software-specific tokens such as error messages, function names, and version information are retained [24].

After normalization, each bug report text

x

is tokenized using the tokenizer associated with the selected transformer model, resulting in a sequence of subword tokens [9,24]:

t = [t_{1}, t_{2}, \dots, t_{L}]

(1)

where L denotes the sequence length. The tokens are subsequently converted into input IDs and attention masks. Padding and truncation are applied to ensure a fixed maximum sequence length.

4.2. Transformer-Based Representation Learning

The tokenized input is mapped to contextual embeddings using a pre-trained transformer encoder f_θ(⋅) [9,24]. Given an input sequence x, the encoder returns a sequence of token-level representations [24]:

H = f_{θ} (x) \in ℝ^{L \times d}

(2)

where d denotes the hidden dimension of the transformer [25]. In order to encode each sentence into a fixed length representation, which can be used for classification, a sentence-level embedding

h \in ℝ^{d}

can be obtained by aggregating token-level embeddings (e.g., mean pooling over valid tokens) [25]:

h = \frac{1}{L} \sum_{i = 1}^{L} H_{i}

(3)

This representation serves as a shared latent feature space for severity classification and domain alignment across bug reports from different software projects.

In this study, the sentence representation of each bug report is obtained using the [CLS] token embedding from the final hidden layer of the transformer encoder. The [CLS] representation is widely used in transformer-based classification tasks because it captures the overall semantic information of the input sequence and is directly optimized during model fine-tuning.

Alternative approaches, such as mean pooling over token embeddings, have also been explored in previous studies. However, the [CLS] representation is adopted in this work to maintain consistency with standard BERT-based classification architectures and to support comparable behavior across DistilBERT and TinyBERT models.

Preliminary experiments comparing [CLS] and mean pooling representations showed that the [CLS]-based representation produced slightly more stable performance across several transfer settings. Therefore, the [CLS] representation is used consistently throughout all experiments.

The experiments are conducted using two lightweight transformer models, DistilBERT and TinyBERT, which are fine-tuned for the bug severity classification task. The maximum sequence length is set to 128 tokens. The models are trained using a batch size of 16 and a learning rate of 2 × 10⁻⁵ with the Adam optimizer, following commonly used transformer fine-tuning settings.

For both models, the [CLS] token embedding from the final hidden layer is used as the input to the classification layer. This design maintains a consistent sentence representation across all experiments and follows standard BERT-based classification frameworks.

4.3. Severity Classification Model

This sentence-level representation is obtained from the [CLS] token embedding of the final hidden layer of the transformer encoder.

In this study, lightweight transformer encoders are used as backbone models for bug severity classification. Compact transformer architectures are selected because they require lower computational cost and faster inference time, which are important considerations in software engineering environments. A more detailed analysis of deployment efficiency is presented in Section 5.3.6 [26].

DistilBERT is selected as the primary backbone model because it provides a smaller model size while maintaining competitive performance relative to the original BERT architecture [27]. Despite its reduced size, DistilBERT still preserves bidirectional contextual representation learning, making it suitable for text classification tasks involving bug reports.

To further examine model behavior across compact transformer architectures, TinyBERT [28] is also included under the same experimental settings. This comparison allows the study to evaluate differences in model capacity and transfer performance between lightweight transformer variants.

A fixed pre-trained transformer encoder is then used to process the preprocessed bug report text x and generate a sequence of contextualized token embeddings:

H = [h_{1}, h_{2}, \dots, h_{L}]

(4)

where

h_{i} = ℝ^{d}

denotes the embedding of the i-th token, and d is the hidden dimension of the encoder.

For severity prediction, a fixed-length representation is generated for each input bug report. Specifically, the sentence representation is obtained from the [CLS] token embedding of the final hidden layer of the transformer encoder.

The [CLS] embedding captures the overall semantic information of the input sequence and is used as the input to the severity classification layer.

The severity classifier consists of a lightweight linear layer followed by a softmax activation:

z = W h_{C L S} + b

(5)

p (y = k | x) = \frac{e x p (z_{k})}{\sum_{j = 1}^{K} e x p (z_{k})}

(6)

where K = 5 corresponds to the severity levels trivial, minor, major, critical, and blocker.

The classification layer is intentionally kept simple to reduce the influence of classifier complexity on the experimental results. This design allows the evaluation to focus more directly on the quality of the learned transformer representations.

To address class imbalance during training, a weighted cross-entropy loss is applied to the labeled source data:

L_{c l s} = - \sum_{k = 1}^{K} w_{k} y_{k} l o g p (y = k | x)

(7)

where w_k is the class-specific weight that scales inversely with the frequency of severity class k in the training dataset, and y_i is an indicator of ground truth label.

By combining compact transformer encoders with a simple classification layer, the proposed framework supports efficient bug severity classification while maintaining competitive predictive performance.

DistilBERT is used as the primary backbone model because it provides a good balance between model performance and computational cost. TinyBERT is also included for comparative analysis to examine how smaller transformer architectures behave under different transfer settings.

4.4. Representation-Level Domain Adaptation

However, representations learned from one project may not generalize well to another project with different reporting characteristics [29]. In transfer settings, the latent feature distributions of the source and target projects may differ substantially [30].

Let

h_{s}

and

h_{t}

denote the latent representations of the source and target projects, respectively [29]. Domain adaptation aims to reduce the discrepancy between these distributions [29]:

P (h_{s}) \neq P (h_{t})

(8)

To reduce the distribution gap between projects, a representation alignment loss is applied between the transformer encoder and the classification layer. Maximum Mean Discrepancy (MMD) is used to align the source and target feature distributions [30]:

L_{a l i g n} = ‖ \frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} h_{s}^{(i)} - \frac{1}{n_{t}} \sum_{j = 1}^{n_{t}} h_{t}^{(j)} ‖

(9)

where n_s and n_t denote the source and target batch sizes, respectively. This alignment helps the model learn feature representations that are more consistent across projects while still preserving severity-related semantic information.

4.5. Joint Optimization Objective

The integrated training objective combines both severity prediction and domain alignment [13]:

L = L_{c l s} + λ L_{a l i g n}

(10)

where λ is a hyperparameter that controls the balance between source-domain classification performance and feature alignment across projects [31,32]. This joint objective helps the model learn feature representations that better separate severity classes while reducing differences between projects, which may improve transfer performance.

4.6. Cross-Project Training Procedure

Training is performed using mini-batches that contain labeled bug reports from the source project(s) together with unlabeled bug reports from the target project [21]. During each iteration, the classification loss is computed using only the labeled source data, while the alignment loss is calculated using representations from both the source and target domains.

Model parameters are updated through gradient-based optimization to minimize the joint loss function. In multi-source settings, labeled data from multiple source projects are combined during training to reduce source-specific bias and improve transfer performance across projects.

Throughout training, the target project remains unlabeled. Ground-truth labels from the target domain are used only during the evaluation phase [33,34].

4.7. Inference on the Target Project

After training, we predict severity for a target bug report text x by calculating its latent representation h and taking a maximum of the predicted probability over classes:

\hat{y} = a r g \max_{k} p (y = k | x)

(11)

where x denotes a bug report text and y denotes its severity class label.

This process is applied consistently across all experimental settings and supports model transfer across different software projects without requiring labeled target data during training.

5. Experimental Setup and Evaluation

This section presents the experimental setup, evaluation metrics, and empirical results used in the study. The experiments examine model performance under different transfer settings and compare the effectiveness of the proposed approaches across multiple software projects.

5.1. Experimental Setup

This study evaluates lightweight transformer models for bug severity classification using Mozilla bug report datasets collected from four projects: Core, Firefox, Thunderbird, and Bugzilla. The experiments follow the transfer settings described in Section 3, including both single-source to single-target and multi-source to single-target scenarios.

To evaluate transfer performance, models are trained using bug reports from one or more source projects and tested on a different target project. This setup ensures that the training and testing data originate from different software systems.

DistilBERT and TinyBERT are used as the primary transformer models. DistilBERT is selected because it provides a good balance between predictive performance and computational cost [27]. TinyBERT is included as a more compact alternative to examine how smaller transformer architectures behave under the same transfer settings [28]. Both models are fine-tuned using the same training pipeline and differ only in their encoder architectures.

For comparison, several baseline models from both traditional machine learning and deep learning are also included. These baselines consist of TF-IDF features combined with Logistic Regression and Support Vector Machines, together with a Convolutional Neural Network (CNN) model using pre-trained Word2Vec embeddings. The selected baselines provide a consistent reference for evaluating transformer-based approaches under the same experimental conditions.

All experiments use a unified dataset configuration. For each project, the data are partitioned into training and testing sets using stratified sampling with an 80:20 split while preserving the original severity distribution. During training, labeled data from the source project(s) are used for optimization, whereas the target project remains unlabeled and is used only during evaluation.

To support fair comparison, all models use the same preprocessing procedures, batch size, learning rate, number of training epochs, and maximum input sequence length. Class imbalance is handled using weighted cross-entropy loss, where class weights are computed inversely proportional to class frequencies in the source training data.

To further analyze the contribution of different components, an ablation study is conducted under three settings:

(1): single-source training without domain adaptation,
(2): multi-source training without domain adaptation, and
(3): multi-source training with representation-level domain adaptation using Maximum Mean Discrepancy (MMD)

These settings allow the study to examine the effects of data diversity and feature alignment on transfer performance.

All experiments are implemented using the Hugging Face Transformers library. DistilBERT-base-uncased is used as the primary backbone model, while TinyBERT-4L-312D is used as a lightweight alternative. The maximum input sequence length is set to 128 tokens. Models are fine-tuned using the AdamW optimizer with a learning rate of 2 × 10⁻⁵, a batch size of 32, and 10 training epochs. Early stopping is applied based on Macro-F1 performance on the source-domain validation set with a patience value of two epochs.

For domain adaptation, representation alignment is performed using MMD with a Gaussian kernel. The alignment coefficient λ is fixed at 0.5 across all experiments. The overall training objective combines the classification loss from labeled source data with the alignment loss between source and target representations.

To reduce variability caused by random initialization, all experiments are repeated using five random seeds {42, 52, 62, 72, 82}. Results are reported as mean ± standard deviation. All experimental settings, including preprocessing procedures, data partitioning, model configurations, and random seeds, are specified explicitly to support reproducibility. The hyperparameter configuration used throughout all experiments is summarized in Table 3.

5.2. Evaluation Metrics

The Mozilla bug severity dataset is highly imbalanced across severity classes. Most bug reports belong to the major and critical categories, while blocker and trivial reports occur less frequently. These class differences may become more pronounced in transfer settings because severity distributions vary across software projects.

To support balanced evaluation across both majority and minority classes, the primary evaluation metric used in this study is the macro-averaged F1 score (Macro-F1) [35,36]. Macro-F1 computes the F1 score independently for each severity class and then averages the results using equal weights. This metric reduces the influence of dominant classes and provides a more balanced evaluation of minority severity categories such as blocker.

Let P_k and R_k denote the precision and recall of severity class

k

, respectively. The F1 score for class

k

is defined as:

F 1_{k} = 2 \times \frac{P_{k} \times R_{k}}{P_{k} + R_{k}}

(12)

The Macro-F1 score is then computed as:

M a c r o - F 1 = \frac{1}{K} \sum_{k = 1}^{K} F 1_{k}

(13)

where k = 5 denotes the number of severity classes.

In addition to Macro-F1, the weighted F1-score (Weighted-F1) is also reported [35,36]. Unlike Macro-F1, Weighted-F1 computes the average F1 score by weighting each class according to its frequency in the test set. Therefore, this metric reflects model performance under the original severity distribution of the dataset and provides a complementary view of overall classification performance.

Statistical significance is evaluated using the Wilcoxon signed-rank test with a significance level of p < 0.05. The results indicate that the transformer-based models generally outperform both traditional machine learning and deep learning baselines across the evaluated transfer settings.

5.3. Experimental Results and Discussion

This section presents the experimental results of the transformer-based models under different transfer settings. The analysis examines model behavior across multiple software projects and evaluates performance under both single-source and multi-source transfer scenarios.

The results are organized according to the proposed research questions. Section 5.3.1 and Section 5.3.3 focus on single-source transfer settings related to RQ1, while Section 5.3.2 and Section 5.3.4 examine the effect of multi-source training in relation to RQ2. Section 5.3.5 investigates the contribution of representation-level domain adaptation associated with RQ3. Section 5.3.6 discusses computational efficiency and deployment-related considerations relevant to RQ4.

Overall, the transformer-based models show relatively stable performance across most transfer settings. However, the degree of improvement varies depending on the similarity between source and target projects.

It is also important to note that all experiments in this study use only the summary/title field of each bug report. Although this setting supports efficient evaluation, it may not capture all contextual information available in longer bug descriptions. Therefore, the reported results should be interpreted within the scope of title-based severity classification.

The baseline models were selected to support controlled comparison under consistent experimental conditions. More computationally intensive models and large-scale LLM-based approaches are not included because they may introduce additional variability related to model size, training data, and computational requirements.

Performance degradation is observed in some transfer settings where substantial differences exist between source and target projects. In particular, transfer settings involving Bugzilla as the target domain generally produce lower performance than other project pairs. This behavior may be associated with its more imbalanced severity distribution and distinct reporting characteristics.

In addition, improvements are less consistent for minority severity categories such as critical and blocker, where limited training samples may make classification more difficult. These observations suggest that transfer performance is influenced by both class distribution and project similarity across domains.

5.3.1. Baseline Performance Under Single-Source Transfer

Table 4 presents the performance of representative traditional machine learning and deep learning baselines under the single-source to single-target transfer setting. These results provide baseline performance comparisons for evaluating bug severity classification across different software projects.

The results also illustrate the difficulty of transfer learning between source–target project pairs and serve as a reference for comparison with the transformer-based models.

The baseline results provide several observations regarding bug severity classification under single-source transfer settings.

In general, the TF-IDF-based classifiers show lower and less stable performance across different projects. This behavior suggests that surface-level lexical features are sensitive to differences in project-specific terminology, reporting styles, and severity labeling practices.

The CNN-based baseline consistently performs better than the TF-IDF-based methods. This result indicates that distributed word embeddings and contextual phrase patterns provide advantages over simple lexical matching. By using pre-trained word embeddings, the CNN model can capture severity-related patterns expressed through short contextual sequences.

However, performance degradation is still observed across all baseline models, especially when transfer involves smaller or more heterogeneous projects. In contrast, transfer settings that use larger or more diverse source projects generally produce more stable results. A broader range of defect types and reporting styles may provide more informative training signals, even without explicit domain adaptation.

Overall, these findings highlight the difficulty of bug severity classification across different software projects when domain adaptation is not applied. The baseline results also provide a reference point for comparison with the transformer-based models discussed in the following sections.

5.3.2. Baseline Performance Under Multi-Source Transfer

This section presents the performance of traditional machine learning and deep learning baselines under the multi-source to single-target transfer setting. In this setting, bug reports from multiple source projects are combined during training before evaluation on a separate target project. Table 5 summarizes the baseline results under the multi-source transfer configuration.

Compared with the single-source setting, multi-source training generally improves the performance of both traditional machine learning and deep learning baselines. Exposure to more diverse source projects provides broader coverage of reporting styles, defect patterns, and severity distributions, which helps improve transfer performance across projects.

However, the improvements remain limited for the TF-IDF-based classifiers. Because these models rely mainly on surface-level lexical features, they are still sensitive to project-specific vocabulary differences. The observed improvements are therefore more likely related to increased term coverage rather than deeper cross-project representation learning.

The CNN-based baseline shows larger improvements under the multi-source setting. This result suggests that distributed word embeddings help capture semantic patterns shared across different projects. Training with more diverse source data also helps the model learn more stable local contextual features related to bug severity.

However, the CNN model still has limitations when target projects differ substantially in reporting structure or annotation style. Fixed convolutional windows may reduce the model’s ability to capture broader contextual information across bug reports.

Performance gaps also remain between the multi-source baselines and the transformer-based models. These results suggest that data aggregation alone may not be sufficient to fully address differences between software projects. More context-aware representation learning methods may still be necessary to improve transfer performance across heterogeneous repositories.

5.3.3. Analysis of Single-Source Transfer Performance

The results in Table 6 show noticeable performance differences across source–target project pairs, indicating that transfer performance is influenced by differences between software projects. In general, transfer settings that use larger and more diverse source projects, such as Firefox and Core, tend to produce higher Macro-F1 scores. This result suggests that richer source data may help the models learn more transferable severity-related patterns.

All results are reported as mean ± standard deviation across five random seeds. Statistical significance is evaluated using the Wilcoxon signed-rank test with a significance level of p < 0.05.

Compared with the traditional machine learning and deep learning baselines, the transformer-based models generally achieve better performance across most transfer settings. These results indicate that contextualized transformer representations are more effective in handling project-specific variations than surface-level lexical features.

Both DistilBERT and TinyBERT also show relatively low standard deviations across random seeds, indicating stable training behavior under different transfer settings. In addition, the Wilcoxon signed-rank test suggests that the observed improvements over the strongest baseline are statistically significant rather than caused by random variation.

Differences in transfer performance across source–target pairs highlight the importance of source data diversity. Larger and more diverse projects generally support better transfer performance because they provide broader coverage of defect types, reporting styles, and severity-related patterns.

In contrast, transfer settings that use Bugzilla as the source project remain more challenging. This behavior may be associated with its smaller dataset size, more imbalanced severity distribution, and project-specific reporting characteristics.

The results also show noticeable asymmetry between several transfer directions. For example, transfer from Firefox to Bugzilla generally produces better performance than transfer from Bugzilla to Firefox. One possible explanation is that Firefox contains a larger and more diverse collection of bug reports, which exposes the model to a wider range of severity-related vocabulary and reporting patterns during training.

By comparison, Bugzilla contains fewer reports and relatively fewer high-severity instances, which may limit the diversity of learned representations. Similar asymmetric behavior can also be observed in other project pairs. These findings suggest that transfer performance is influenced not only by semantic similarity between projects but also by differences in dataset size, reporting style, and severity distribution.

Between the two transformer models, DistilBERT consistently achieves better performance than TinyBERT. This result suggests that larger representational capacity helps capture more nuanced semantic information from bug reports. Nevertheless, TinyBERT still maintains competitive performance with relatively low variance across experiments, indicating that compact transformer models can still provide stable transfer performance with lower computational cost.

Overall, the transformer-based models provide more stable performance than the baseline approaches under single-source transfer settings. Although performance differences across projects still remain, the transformer models generally reduce the degradation observed in the traditional machine learning and CNN-based baselines.

The results also indicate that lightweight transformer models can maintain relatively consistent transfer performance across different project pairs. However, model effectiveness still depends on the similarity between source and target projects.

To further examine model behavior, several qualitative examples were analyzed. In successful cases, the transformer-based models correctly identified high-severity bug reports when the summaries contained explicit severity-related expressions or clear descriptions of operational impact. Bug reports related to system crashes, security failures, or major functional disruption were more consistently classified into critical or blocker categories across projects.

However, several failure cases were also observed. Misclassification commonly occurred when bug summaries were short, ambiguous, or highly project-specific. In some cases, reports containing implicit severity cues or domain-specific terminology were incorrectly assigned to neighboring severity categories, such as major instead of critical.

These observations indicate that severity prediction across software projects remains sensitive to vocabulary differences and contextual ambiguity, particularly when source and target projects differ substantially in reporting characteristics. Table 7 provides qualitative examples illustrating both successful predictions and common misclassification cases observed in the cross-project setting.

5.3.4. Analysis of Multi-Source Transfer Performance

Table 8 shows that multi-source training improves performance in most transfer settings, although the improvements are generally moderate. Macro-F1 scores increase across all target projects, indicating better performance on minority severity classes.

These results suggest that combining labeled data from multiple projects helps the models learn from a broader range of reporting styles, severity distributions, and lexical patterns. Exposure to more diverse source data may therefore improve transfer performance across projects.

The results are reported using the same evaluation protocol as in Table 6. Compared with the single-source setting, the transformer-based models also show lower variance across random seeds under multi-source training. This behavior suggests more stable optimization and reduced sensitivity to project-specific patterns during training.

Multi-source transfer appears to be particularly beneficial when the target project differs substantially from any individual source project. In these cases, combining multiple training projects provides broader coverage of defect types and reporting patterns, which helps improve transfer performance.

DistilBERT generally benefits more from multi-source training than TinyBERT. This result suggests that models with larger representational capacity are better able to utilize diverse training signals across projects. Nevertheless, TinyBERT still maintains competitive and relatively stable performance, indicating that compact transformer models can also perform effectively under multi-source settings.

The performance improvements obtained through multi-source transfer are also statistically significant compared with the strongest non-transformer baseline, as indicated by the Wilcoxon signed-rank test. This result suggests that the observed improvements are consistent across experiments rather than caused mainly by random initialization effects.

Overall, the findings indicate that multi-source training improves transfer performance by exposing the models to more diverse reporting patterns and severity distributions across projects.

5.3.5. Ablation Study: Impact of Domain Alignment and Multi-Source Training

To examine the effects of multi-source training and domain alignment, an ablation study is conducted using three configurations:

(1): single-source transformer models,
(2): multi-source transformers without domain alignment, and
(3): multi-source transformers with MMD-based domain adaptation.

The results show that multi-source training generally improves performance by exposing the models to more diverse reporting styles and severity distributions. Additional improvements from domain alignment are also observed, although the gains are generally moderate and vary across transfer settings. Table 9 presents the ablation results comparing single-source training, multi-source training, and MMD-based domain adaptation.

Second, MMD-based domain alignment further improves performance for both transformer models. For DistilBERT, the Macro-F1 score increases from 0.52 to 0.56 after domain alignment is applied, while TinyBERT improves from 0.49 to 0.53. Although the improvements are moderate, similar trends are observed across both models.

The ablation results suggest that the observed performance gains are not caused only by increased training data diversity. Instead, multi-source training and domain alignment appear to provide complementary benefits. Multi-source training exposes the models to more diverse reporting patterns, while domain alignment helps reduce differences between source and target feature distributions.

To further examine the reliability of the observed improvements, paired t-tests are conducted on Macro-F1 scores across the transfer settings. Comparisons are performed between single-source, multi-source, and domain-adapted configurations of the same backbone models under identical experimental conditions. Statistical significance is evaluated at the 0.05 level.

The results indicate that some performance improvements are statistically significant, although several gains remain relatively small in magnitude. Therefore, the practical impact of the improvements should be interpreted together with the observed effect sizes rather than statistical significance alone.

5.3.6. Deployment Efficiency Analysis

In addition to predictive performance, this study also evaluates deployment efficiency. Lightweight transformer models are particularly suitable for software engineering environments with limited computational resources and response-time constraints.

Table 10 summarizes the model size and approximate inference time of the evaluated transformer architectures. All efficiency measurements were conducted under the same experimental environment to support consistent comparison across models.

Inference time was measured on the Google Colab cloud computing platform using a maximum input sequence length of 128 tokens. The experiments were implemented using Python 3.10, PyTorch 2.1, and the Hugging Face Transformers library (version 4.38). For each model, inference time was averaged across the target test samples under identical batch-processing conditions. These measurements are intended to provide relative efficiency comparisons between lightweight transformer architectures rather than hardware-independent latency benchmarks.

DistilBERT substantially reduces the number of parameters compared with the original BERT architecture, while TinyBERT provides further model compression. As a result, both models achieve lower memory usage and faster inference time, which may be beneficial for automated bug triage systems.

Although TinyBERT achieves slightly lower Macro-F1 scores than DistilBERT, its substantially smaller model size makes it attractive for environments with limited computational resources. Therefore, the choice between DistilBERT and TinyBERT depends on the balance between predictive performance and deployment efficiency.

As shown in Table 10, DistilBERT reduces the parameter count by approximately 40% compared with the standard BERT architecture, while TinyBERT further compresses the model to approximately 14 million parameters and achieves the fastest inference time among the evaluated models.

Overall, the results suggest that lightweight transformer models can provide a reasonable balance between predictive performance and computational efficiency for automated bug triage applications.

5.3.7. Limitations of the Study

Several limitations should be considered when interpreting the findings of this study.

First, the experiments are conducted only on Mozilla-related projects, including Core, Firefox, Thunderbird, and Bugzilla. Although these projects differ in reporting styles, severity distributions, and project characteristics, they still belong to the same Bugzilla ecosystem. Therefore, the findings may not fully generalize to software repositories that use different issue-tracking platforms or maintenance practices. Additional evaluation on external repositories, such as Eclipse- or Apache-related projects, would help provide broader evidence of generalizability.

Second, the study uses only the summary/title field of each bug report as textual input. This setting supports efficient evaluation and reflects early-stage bug triage scenarios. However, full bug reports may contain additional contextual information, including reproduction steps, stack traces, environment configurations, and developer discussions, which could further improve severity prediction performance.

Third, the experiments are limited to four software projects under single-source and multi-source transfer settings. As a result, the observed transfer behavior may still be influenced by project-specific severity distributions, vocabulary patterns, and reporting characteristics. Additional validation using larger and more diverse software repositories would help further examine the stability of the proposed framework across different environments.

Finally, the severity labels used in this study originate from manually assigned labels in the original bug tracking repositories. Although these labels reflect realistic software maintenance practices, severity assessment may still involve some degree of subjectivity across projects and developers. This characteristic may introduce additional variability into severity prediction performance.

5.4. Discussion

The experimental results provide several insights into the factors affecting bug severity classification across different software projects. In general, the findings suggest that transfer performance is influenced by both representation quality and data diversity.

The single-source baseline results in Section 5.3.1 show that traditional models based mainly on surface-level lexical features still experience substantial performance degradation during transfer. This behavior suggests that differences between projects involve not only vocabulary variation but also changes in reporting style and severity annotation practices. In many cases, baseline performance appears to depend heavily on vocabulary overlap between source and target projects.

The multi-source baseline results in Section 5.3.2 indicate that combining data from multiple projects improves transfer performance to some extent. Exposure to more diverse defect types and reporting patterns helps increase feature coverage across projects. However, the remaining performance gap suggests that data diversity alone is not sufficient to fully address differences between software repositories.

The transformer-based results in Section 5.3.3 further demonstrate the advantages of contextual representation learning. Compared with the baseline models, the transformer architectures show better transfer performance across most project pairs. Even under single-source transfer settings, the models are able to capture severity-related contextual information beyond simple token frequency or lexical matching.

At the same time, transfer performance still varies considerably across source–target pairs. Some transfer directions remain substantially more difficult than others, particularly when the projects differ in reporting characteristics or severity distributions. These findings indicate that domain differences remain an important challenge for bug severity prediction across projects.

The multi-source transformer results in Section 5.3.4 suggest that data diversity and contextual representation learning provide complementary benefits. Multi-source training improves model stability by exposing the transformer models to a wider range of reporting patterns and severity distributions. The transformer architectures also appear more capable of generalizing semantic information across projects than the traditional baseline models.

Nevertheless, some target projects remain difficult even under multi-source transfer settings. This observation suggests that the most challenging transfer cases may reflect deeper structural differences between software projects rather than only limited training data availability.

Overall, the findings suggest that improving transfer performance across software projects requires both diverse training data and strong contextual representation learning. Multi-source training helps broaden project coverage, while transformer-based architectures provide more effective semantic representations for handling project-specific variation.

5.5. Threats to Validity

Several limitations should be considered when interpreting the results of this study.

First, the experiments are conducted only on bug reports collected from the Mozilla ecosystem. Although this dataset is widely used in software engineering research, the findings may not fully generalize to repositories from other organizations or software domains.

Second, the severity labels are manually assigned by reporters and triagers. As a result, some degree of subjectivity and labeling inconsistency may exist in the dataset. While this reflects realistic software maintenance practice, label noise may still affect classification performance.

Finally, the analysis includes only English-language bug reports. The proposed framework has not been evaluated on multilingual datasets, and its effectiveness in non-English environments remains open for further investigation.

6. Conclusions

This study investigates bug severity classification across different software projects using lightweight transformer models under transfer learning settings. The experiments focus on situations where labeled data in the target project are limited and differences between projects affect transfer performance.

The experimental results on Mozilla bug repositories show that transfer performance is strongly influenced by source–target project characteristics. In general, larger and more diverse source projects provide better transfer performance, while substantial differences in reporting style and severity distribution remain challenging for transfer learning.

Across most experimental settings, multi-source training improves performance compared with single-source transfer, particularly for minority severity classes. These findings suggest that combining data from multiple related projects helps improve transfer stability across software repositories.

The comparison between DistilBERT and TinyBERT also shows that lightweight transformer models can achieve competitive performance under transfer settings. DistilBERT generally produces higher Macro-F1 scores, while TinyBERT provides comparable performance with lower computational cost and faster inference time.

Overall, the findings suggest that lightweight transformer models provide a practical approach for automated bug triage systems, especially in environments with limited computational resources. The results also indicate that combining contextual representation learning with multi-source training can improve transfer performance across software projects.

Future work will extend the evaluation to additional software ecosystems, including Eclipse- and Apache-related repositories, to further examine generalizability across different development environments. Future studies may also investigate the use of full bug descriptions and combined title-and-description representations for transfer-based bug severity classification.

Author Contributions

Conceptualization, L.Z., J.P. and S.W.; Methodology, L.Z., J.P. and S.W.; Software, L.Z. and J.P.; Validation, L.Z., J.P. and S.W.; Formal Analysis, L.Z. and J.P.; Investigation, L.Z., J.P. and S.W.; Data Curation, L.Z. and J.P., Writing—Original Draft, L.Z. and J.P.; Writing—Review and Editing, J.P.; Visualization, L.Z. and J.P.; Supervision, J.P. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Mahasarakham University, grant number 6902196.

Data Availability Statement

The data used in this study are derived from publicly available Mozilla Bugzilla bug reports, which can be accessed through the Bugzilla platform (https://bugzilla.mozilla.org/), accessed on 13 January 2026, under its respective data usage and licensing policies. The dataset can be reconstructed by querying the Bugzilla repository using the project identifiers and severity labels described in this study. Due to licensing and redistribution restrictions, the raw and processed datasets are not re-hosted by the authors. However, the data processing procedures, including severity labeling, project-wise filtering, and stratified 80/20 train–test splitting, are fully described in the manuscript to support reproducibility. Processed datasets and data split configurations may be provided by the corresponding author upon reasonable request for academic research purposes. All implementation details necessary to reproduce the experiments, including model configurations, hyperparameters, random seeds, and domain adaptation settings, are described in the manuscript. Model development is based on publicly available frameworks, including the Hugging Face Transformers library.

Acknowledgments

This work was financially support by Mahasarakham University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Arokiam, J.; Bradbury, J.S. Automatically Predicting Bug Severity Early in the Development Process. In Proceedings of the 42nd International Conference on Software Engineering: New Ideas and Emerging Results, Seoul, Republic of Korea; IEEE: New York, NY, USA, 2020; pp. 17–20. [Google Scholar]
Tao, Z.; Chen, J.; Yang, G.; Lee, B.; Luo, X. Towards More Accurate Severity Prediction and Fixer Recommendation of Software Bugs. J. Syst. Softw. 2016, 117, 166–184. [Google Scholar] [CrossRef]
Hamdy, A.; Ezzat, G. Deep Mining of Open Source Software Bug Repositories. Int. J. Comput. Appl. 2022, 44, 614–622. [Google Scholar] [CrossRef]
Hamdy, A.; El-Laithy, A. Semantic Categorization of Software Bug Repositories for Severity Assignment Automation. In Integrating Research and Practice in Software Engineering; Jarzabek, S., Poniszewska-Marańda, A., Madeyski, L., Eds.; Springer: Cham, Switzerland, 2020; pp. 15–30. [Google Scholar]
Luaphol, B.; Polpinij, J.; Kaenampornpan, M. Text Mining Approaches for Dependent Bug Report Assembly and Severity Prediction. Int. Arab J. Inf. Technol. 2022, 19, 915–924. [Google Scholar] [CrossRef]
Sarawan, K.; Polpinij, J.; Luaphol, B. Machine Learning-Based Methods for Identifying Bug Severity Level from Bug Reports. In Proceedings of the International Conference on Computing and Information Technology, Bangkok, Thailand; Springer: Berlin/Heidelberg, Germany, 2023; pp. 199–208. [Google Scholar]
Bahaa, A.; Fathy, E.M.; Eldin, A.S.; Abd-Elmegid, L.A. A Systematic Literature Review of Software Defect Prediction using Deep Learning. J. Comput. Sci. 2021, 17, 490–510. [Google Scholar] [CrossRef]
Long, G.; Gong, J.; Fang, H.; Chen, T. Learning Software Bug Reports: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2025, 37, 111. [Google Scholar] [CrossRef]
Arshad, A.A.; Riaz, A.; Fatima, R.; Yasin, A. SevPredict: Exploring the Potential of Large Language Models in Software Maintenance. Appl. Inform. 2024, 5, 2739–2760. [Google Scholar] [CrossRef]
Rumman, M.; Roy, E.; Zaman, A.; Bradbury, J.S. A Contrastive Learning Approach to Bug Severity Classification with Large Language Model Embeddings. In Proceedings of the 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada; IEEE Computer Society: Los Alamitos, CA, USA, 2025; pp. 1376–1381. [Google Scholar]
Zimmermann, T.; Nagappan, N.; Gall, H.; Giger, G.; Murphy, B. Cross-Project Defect Prediction: A Large-Scale Experiment on Data vs. Domain vs. Process. In Proceedings of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore; Association for Computing Machinery: New York, NY, USA, 2009; pp. 91–100. [Google Scholar]
Colavito, G.; Lanubile, F.; Novielli, N.; Arreza, C.; Shi, Y. Issue Classification with LLMs: An Empirical Study of the NASA flight Software Systems. J. Syst. Softw. 2026, 237, 112851. [Google Scholar] [CrossRef]
Agrawal, R.; Goyal, R. Developing bug severity prediction models using word2vec. Int. J. Cogn. Comput. Eng. 2021, 2, 104–115. [Google Scholar] [CrossRef]
Du, X.; Zhou, Z.; Yin, B.; Xiao, G. Cross-Project Bug Type Prediction Based on Transfer Learning. Softw. Qual. J. 2020, 28, 39–57. [Google Scholar] [CrossRef]
Nam, J.; Pan, S.J.; Kim, S. Transfer Defect Learning. In Proceedings of the International Conference on Software Engineering, San Francisco, CA, USA; Association for Computing Machinery: New York, NY, USA, 2013; pp. 382–391. [Google Scholar]
Sotto-Mayor, B.; Kalech, M. A Survey on Transfer Learning for Cross-Project Defect Prediction. IEEE Access 2024, 12, 93398–93425. [Google Scholar] [CrossRef]
Zirak, A.; Hemmati, H. Improving Automated Program Repair with Domain Adaptation. ACM Trans. Softw. Eng. Methodol. 2022, 33, 43. [Google Scholar] [CrossRef]
Li, Z.; Li, Y.; Li, T.; Du, M.; Wu, B.; Cao, Y.; Xie, X.; Li, Y.; Liu, Y. Unveiling Project-Specific Bias in Neural Code Models. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy, 20–25 May 2024; pp. 17205–17216. [Google Scholar]
Mock, M.; Forrer, T.; Russo, B. Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open and Industry Data. In Proceedings of the 26th International Conference on Product-Focused Software Process Improvement, Salerno, Italy, 1–3 December 2025; pp. 36–52. [Google Scholar]
Wei, Y.; Zhang, C.; Ren, T. Improving Bug Severity Prediction with Domain-Specific Representation Learning. IEEE Access 2023, 11, 62829–62839. [Google Scholar] [CrossRef]
Hu, B.; Wang, J. A Weighted Multi-Source Domain Adaptation Approach for Surface Defect Detection. IET Image Process. 2022, 16, 2210–2218. [Google Scholar] [CrossRef]
Xiao, Y.; Zuo, X.; Lu, X.; Dong, J.S.; Cao, X.; Beschastnikh, I. Promises and Perils of Using Transformer-Based Models for SE Research. Neural Netw. 2025, 184, 107067. [Google Scholar] [CrossRef]
von der Mosel, J.; Trautsch, A.; Herbold, S. On the Validity of Pre-Trained Transformers for Natural Language Processing in the Software Engineering Domain. IEEE Trans. Softw. Eng. 2023, 49, 1487–1507. [Google Scholar] [CrossRef]
Wang, M.; Cai, B.; Zou, W.; Zhang, J. Keys4BR: Key Sentences-based Model Fine-Tuning for Better Semantic Representation of Bug Reports. Inf. Softw. Technol. 2026, 189, 107943. [Google Scholar] [CrossRef]
Stankevicius, L.; Lukoševičius, M. Extracting Sentence Embeddings from Pretrained Transformer Models. Appl. Sci. 2024, 14, 8887. [Google Scholar] [CrossRef]
Grishina, A.; Hort, M.; Moonen, L. The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA; Association for Computing Machinery: New York, NY, USA, 2023; pp. 895–907. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (EMC2-NIPS), Vancouver, BC, Canada, 13 December 2019; pp. 1–5. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online, 16–20 November 2020; pp. 4163–4174. [Google Scholar]
Pan, S.J.; Tsang, I.; Kwok, J.; Yang, Q. Domain Adaptation via Transfer Component Analysis. IEEE Trans. Neural Netw. 2011, 22, 199–210. [Google Scholar] [CrossRef]
Luo, Y.; Ren, J.; Peng, M.; Zhang, J.; Li, J. Unsupervised Domain Adaptation via Discriminative Manifold Propagation and Maximum Mean Discrepancy. Knowl. -Based Syst. 2021, 229, 107286. [Google Scholar]
Chen, C.; Chen, Z.; Jiang, B.; Jin, X. Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA; AAAI Press: Washington, DC, USA, 2019; pp. 3296–3303. [Google Scholar]
Jiang, S.; Zhang, J.; Guo, F.; Teng, O.; Li, J. Balanced Adversarial Tight Matching for Cross-Project Defect Prediction. IET Softw. 2024, 2024, 1–19. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, S.; Wu, G.; Costeira, J.P.; Moura, J.M.F.; Gordon, G.J. Adversarial Multiple Source Domain Adaptation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada; Curran Associates Inc.: New York, NY, USA, 2018; pp. 1–12. [Google Scholar]
Li, J.; Xu, Z.; Yongkang, W.; Zhao, Q.; Kankanhalli, M. GradMix: Multi-Source Transfer across Domains and Tasks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, Arizona, 6–10 March 2020; pp. 3019–3027. [Google Scholar]
Lee, M.C.H.; Braet, J.; Springael, J. Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Appl. Sci. 2024, 14, 9863. [Google Scholar] [CrossRef]
Opitz, J. A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice. Trans. Assoc. Comput. Linguist. 2024, 12, 820–836. [Google Scholar] [CrossRef]

Figure 1. An example screen of bug report.

Table 1. Summary of the datasets used.

Severity	Datasets
Severity	Firefox	Core	Thunderbird	Bugzilla
Trivial	1111	864	740	788
Minor	3332	2591	1555	920
Major	7550	6046	2813	808
Critical	9772	7255	2073	112
Blocker	444	518	222	40
Total	22,209	17,274	7403	2468

Table 2. Summary of cross-project experimental settings on the Mozilla datasets.

Cross-Project Experimental Settings	Source Project (Train)	Target Project (Test)
A. Single-Source → Single-Target Transfer (12 Directed Pairs)	Core	Firefox
	Core	Thunderbird
	Core	Bugzilla
	Firefox	Core
	Firefox	Thunderbird
	Firefox	Bugzilla
	Thunderbird	Core
	Thunderbird	Firefox
	Thunderbird	Bugzilla
	Bugzilla	Core
	Bugzilla	Firefox
B. Multi-Source → Single-Target Transfer (Leave-One-Project-Out, 4 Settings)	Core + Thunderbird + Bugzilla	Firefox
	Firefox + Thunderbird+ Bugzilla	Core
	Core + Firefox + Bugzilla	Thunderbird
	Core + Firefox + Thunderbird	Bugzilla

Table 3. Hyperparameter configuration used in all experiments.

Parameter	Value
Backbone models	DistilBERT-base-uncased, TinyBERT-4L-312D
Optimizer	AdamW
Learning rate	2 × 10⁻⁵
Batch size	32
Epochs	10
Max sequence length	128
Loss function	Weighted Cross-Entropy
Domain alignment	MMD (Gaussian kernel)
Alignment coefficient (λ)	0.5
Random seeds	42, 52, 62, 72, 82

Table 4. Baseline performance under single-source transfer settings.

Source → Target	LR Macro-F1	LR Weighted-F1	SVM Macro-F1	SVM Weighted-F1	CNN Macro-F1	CNN Weughted-F1
Core → Firefox	0.34	0.58	0.36	0.60	0.39	0.62
Core → Thunderbird	0.32	0.56	0.34	0.58	0.37	0.60
Core → Bugzilla	0.28	0.52	0.30	0.54	0.33	0.56
Firefox → Core	0.36	0.60	0.38	0.62	0.41	0.64
Firefox → Thunderbird	0.33	0.57	0.35	0.59	0.38	0.61
Firefox → Bugzilla	0.29	0.53	0.31	0.55	0.34	0.57
Thunderbird → Core	0.31	0.55	0.33	0.57	0.36	0.59
Thunderbird → Firefox	0.33	0.56	0.35	0.58	0.38	0.60
Thunderbird → Bugzilla	0.27	0.51	0.29	0.53	0.32	0.55
Bugzilla → Core	0.26	0.50	0.28	0.52	0.31	0.54
Bugzilla → Firefox	0.27	0.51	0.29	0.53	0.32	0.55
Bugzilla → Thunderbird	0.25	0.49	0.27	0.51	0.30	0.53
Average	0.30	0.54	0.32	0.56	0.35	0.57

Note: LR = TF-IDF + Logistic Regression; SVM = TF-IDF + Support Vector Machine; CNN = CNN + Word2Vec. Bold values indicate the highest performance for each source–target pair.

Table 5. Baseline performance under multi-source transfer settings.

Source → Target	LR Macro-F1	LR Weighted-F1	SVM Macro-F1	SVM Weighted-F1	CNN Macro-F1	CNN Weighted-F1
{Firefox, Thunderbird, Bugzilla} → Core	0.34	0.57	0.36	0.59	0.39	0.62
{Core, Thunderbird, Bugzilla} → Firefox	0.35	0.58	0.37	0.60	0.40	0.63
{Core, Firefox, Bugzilla} → Thunderbird	0.33	0.56	0.35	0.58	0.38	0.61
{Core, Firefox, Thunderbird} → Bugzilla	0.31	0.54	0.33	0.56	0.36	0.59
Average	0.33	0.56	0.35	0.58	0.38	0.61

Note: LR = TF-IDF + Logistic Regression; SVM = TF-IDF + Support Vector Machine; CNN = CNN + Word2Vec. Bold values indicate the highest Macro-F1 score in each transfer setting.

Table 6. Cross-project performance under single-source transfer.

Model	Source (Train)	Target (Test)	Macro-F1	Weighted-F1
DistilBERT	Core	Firefox	0.48 ± 0.02 *	0.66 ± 0.01 *
		Thunderbird	0.45 ± 0.02 *	0.63 ± 0.01 *
		Bugzilla	0.39 ± 0.03 *	0.58 ± 0.02 *
	Firefox	Core	0.50 ± 0.02 *	0.68 ± 0.01 *
		Thunderbird	0.47 ± 0.02 *	0.65 ± 0.01 *
		Bugzilla	0.41 ± 0.03 *	0.60 ± 0.02 *
	Thunderbird	Core	0.44 ± 0.02 *	0.62 ± 0.01 *
		Firefox	0.46 ± 0.02 *	0.64 ± 0.01 *
		Bugzilla	0.38 ± 0.03 *	0.57 ± 0.02 *
	Bugzilla	Core	0.36 ± 0.03 *	0.55 ± 0.02 *
		Firefox	0.37 ± 0.03 *	0.56 ± 0.02 *
		Thunderbird	0.35 ± 0.03 *	0.54 ± 0.02 *
Average (DistilBERT)			0.42 ± 0.03 *	0.61 ± 0.02 *
TinyBERT	Core	Firefox	0.45 ± 0.03	0.63 ± 0.02
		Thunderbird	0.42 ± 0.03	0.60 ± 0.02
		Bugzilla	0.36 ± 0.03	0.55 ± 0.02
	Firefox	Core	0.47 ± 0.02	0.65 ± 0.01
		Thunderbird	0.44 ± 0.03	0.62 ± 0.02
		Bugzilla	0.38 ± 0.03	0.57 ± 0.02
	Thunderbird	Core	0.41 ± 0.02	0.59 ± 0.01
		Firefox	0.43 ± 0.02	0.61 ± 0.01
		Bugzilla	0.35 ± 0.03	0.54 ± 0.02
	Bugzilla	Core	0.33 ± 0.03	0.52 ± 0.02
		Firefox	0.34 ± 0.03	0.53 ± 0.02
		Thunderbird	0.32 ± 0.03	0.51 ± 0.02
Average (TinyBERT)			0.39 ± 0.03	0.58 ± 0.02

Note: Bold values indicate the best performance for each source–target transfer setting. * indicates statistically significant improvement compared with the strongest baseline model based on the Wilcoxon signed-rank test (p < 0.05).

Table 7. Qualitative examples of transfer-based bug severity classification.

Bug Summary	True Severity	Predicted Severity	Observation
“Browser crashes when opening encrypted PDF”	Critical	Critical	Explicit failure cue
“UI alignment issue after plugin update”	Minor	Major	Ambiguous wording

Table 8. Cross-project performance under multi-source transfer.

Model	Training Projects	Target (Test)	Macro-F1	Weighted-F1
DistilBERT	Core + Thunderbird + Bugzilla	Bugzilla	0.56 ± 0.02 *	0.72 ± 0.01 *
	Firefox + Thunderbird + Bugzilla	Firefox	0.58 ± 0.02 *	0.74 ± 0.01 *
	Core + Firefox + Bugzilla	Core	0.54 ± 0.02 *	0.70 ± 0.01 *
	Core + Firefox + Thunderbird	Thunderbird	0.49 ± 0.03 *	0.66 ± 0.02 *
Average (DistilBERT)			0.54 ± 0.02 *	0.71 ± 0.01 *
TinyBERT	Core + Thunderbird + Bugzilla	Bugzilla	0.53 ± 0.02	0.69 ± 0.01
	Firefox + Thunderbird + Bugzilla	Firefox	0.55 ± 0.02	0.71 ± 0.01
	Core + Firefox + Bugzilla	Core	0.51 ± 0.02	0.67 ± 0.01
	Core + Firefox + Thunderbird	Thunderbird	0.46 ± 0.03	0.63 ± 0.02
Average (TinyBERT)			0.51 ± 0.02	0.68 ± 0.01

Note: Bold values indicate the best performance for each target project under the multi-source transfer setting. * indicates statistically significant improvement compared with the strongest baseline model based on the Wilcoxon signed-rank test (p < 0.05).

Table 9. Ablation study on domain alignment and multi-source training.

Model	Training Setting	Domain Adaptation	Macro-F1	Weighted-F1
DistilBERT	Single-source	No	0.45 ± 0.02	0.63 ± 0.02
	Multi-source	No	0.52 ± 0.02	0.69 ± 0.01
	Multi-source	MMD	0.56 ± 0.02 *	0.72 ± 0.01 *
Average (DistilBERT)			0.51 ± 0.02	0.68 ± 0.01
TinyBERT	Single-source	No	0.42 ± 0.02	0.60 ± 0.02
	Multi-source	No	0.49 ± 0.02	0.66 ± 0.01
	Multi-source	MMD	0.53 ± 0.02 *	0.69 ± 0.01 *
Average (TinyBERT)			0.48 ± 0.02	0.65 ± 0.01

Note: Bold values indicate the best performance for each transformer backbone under the ablation study setting. * indicates statistically significant improvement compared with the corresponding non-domain-adapted configuration based on the Wilcoxon signed-rank test (p < 0.05).

Table 10. Deployment-oriented metrics.

Model	Parameters	Model Size	Avg Inference Time (ms/Sample)	Memory Usage
BERT-base	110 M	~420 MB	22 ms	High
DistilBERT	66 M	~255 MB	12 ms	Medium
TinyBERT	14 M	~55 MB	6 ms	Low

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, L.; Wiangsamut, S.; Polpinij, J. Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering. Appl. Sci. 2026, 16, 6026. https://doi.org/10.3390/app16126026

AMA Style

Zhu L, Wiangsamut S, Polpinij J. Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering. Applied Sciences. 2026; 16(12):6026. https://doi.org/10.3390/app16126026

Chicago/Turabian Style

Zhu, Liangliang, Samruan Wiangsamut, and Jantima Polpinij. 2026. "Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering" Applied Sciences 16, no. 12: 6026. https://doi.org/10.3390/app16126026

APA Style

Zhu, L., Wiangsamut, S., & Polpinij, J. (2026). Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering. Applied Sciences, 16(12), 6026. https://doi.org/10.3390/app16126026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Integrating Lightweight Transformers for Cross-Project Bug Severity Classification: An Applied AI Approach in Software Engineering

Featured Application

Abstract

1. Introduction

2. Dataset Description

3. Cross-Project Experimental Design

4. Research Methodology

4.1. Text Pre-Processing

4.2. Transformer-Based Representation Learning

4.3. Severity Classification Model

4.4. Representation-Level Domain Adaptation

4.5. Joint Optimization Objective

4.6. Cross-Project Training Procedure

4.7. Inference on the Target Project

5. Experimental Setup and Evaluation

5.1. Experimental Setup

5.2. Evaluation Metrics

5.3. Experimental Results and Discussion

5.3.1. Baseline Performance Under Single-Source Transfer

5.3.2. Baseline Performance Under Multi-Source Transfer

5.3.3. Analysis of Single-Source Transfer Performance

5.3.4. Analysis of Multi-Source Transfer Performance

5.3.5. Ablation Study: Impact of Domain Alignment and Multi-Source Training

5.3.6. Deployment Efficiency Analysis

5.3.7. Limitations of the Study

5.4. Discussion

5.5. Threats to Validity

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI