Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization

: Cross-project defect prediction (CPDP) aims to predict software defects in a target project domain by leveraging information from different source project domains, allowing testers to identify defective modules quickly. However, CPDP models often underperform due to different data distributions between source and target domains, class imbalances, and the presence of noisy and irrelevant instances in both source and target projects. Additionally, standard features often fail to capture sufficient semantic and contextual information from the source project, leading to poor prediction performance in the target project. To address these challenges, this research proposes Smote Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which first employs a novel hybrid technique that extends the synthetic minority over-sampling technique (SMOTE) with edited nearest neighbors (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains. Furthermore, correlation-based feature selection (CFS) with best-first search (BFS) is utilized to identify and select the most important features, aiming to reduce the differences in data distribution among projects. Additionally, SCAG-LSTM integrates bidirectional gated recurrent unit (Bi-GRU) and bidirectional long short-term memory (Bi-LSTM) networks to enhance the effectiveness of the long short-term memory (LSTM) model. These components efficiently capture semantic and contextual information as well as dependencies within the data, leading to more accurate predictions. Moreover, an attention mechanism is incorporated into the model to focus on key features, further improving prediction performance. Experiments are conducted on apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets and compared with active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), the two-phase feature importance amplification method (TFIA) on AEEEM and the two-phase transfer learning method (TPTL), domain adaptive kernel twin support vector machines method (DA-KTSVMO), and generative adversarial long-short term memory neural networks method (GB-CPDP) on PROMISE datasets. The results demonstrate that the proposed SCAG-LSTM model enhances the baseline models by 33.03%, 29.15% and 1.48% in terms of F1-measure and by 16.32%, 34.41% and 3.59% in terms of Area Under the Curve ( AUC ) on the AEEEM dataset, while on the PROMISE dataset it enhances the baseline models’ F1-measure by 42.60%, 32.00% and 25.10% and AUC by 34.90%, 27.80% and 12.96%. These findings suggest that the proposed model exhibits strong predictive performance.


Introduction
People are depending more and more on software systems in their daily lives due to the growth of computer science, software development technology, and digital infor-mation technology, which raise the standards for software quality [1][2][3].Software system failure results from defects in the system, which puts the security of the system in grave danger [4,5].Software defect prediction (SDP) technology, however, can make it easier for testers and software system developers to find bugs in software.As a result, in this context, comprehensive study of SDP technology is becoming increasingly crucial.SDP aims to assist software engineers in allocating scarce resources to enhance the quality of software products by devising an efficient method for predicting the errors in a particular software project [6][7][8].Over time, numerous methods have been put forth and used in software development, assisting practitioners in allocating limited testing resources to modules that frequently exhibit defects.Early studies concentrated on within-project defect prediction (WPDP), which learned the SDP model from previous data from the same project and then used it to predict defects in the releases that were coming soon [9][10][11].The SDP model's preliminary research indicates that, if there are adequate sample data from the same project, learning-derived prediction performance will be effective in the same project.Numerous software projects with adequate sample data have been kept for long-term software development and research.As a result, researchers naturally consider using the sample data from well-known software projects in order to learn the model and apply it to predict defects in other software projects, which is the cross-project defect prediction (CPDP) model's design principle [12][13][14][15].
Several CPDP models have been presented recently, and many academics have taken an interest in them [16][17][18].In the realm of software defect prediction, machine learning approaches, such as decision trees and neural networks, have been successful, as evidenced by the literature, which also found that these models performed well [19][20][21].However, there is still room for improvement in CPDP's predictive accuracy by minimizing distribution differences and class imbalance difficulties.There is a problem of class imbalance if there is a tendency towards considerably fewer modules with defects than there are modules without defects [22].Because CPDP models may favor the majority while classifying, the class imbalance issue may have an impact on their performance [23].As a result, creating strategies to successfully address the class imbalance issue in software projects is a common area of study in CPDP.Numerous studies [24][25][26] have either suggested or assessed various strategies.When there is imbalance and overlap in the data, the prediction power decreases.However, the existence of noisy and irrelevant instances among source and target projects has a bigger impact on the prediction performance of the CPDP models than class imbalance, which is not inherently problematic.
On the other hand, when projecting the target project, the prediction model constructed using the pertinent data from the source project is unable to produce the best possible prediction performance.The primary cause is the significant difference in data distribution between the target and source projects.The distribution of features between projects and variations between instances account for the majority of the variation in data distribution.In the past, academics believed that the distributions of the source and target projects would be the same.In actuality, the source data for these projects are inherently distributed differently, since the projects created by various teams and businesses are invariably distinct in terms of scope, function, and coding standards.In other words, the distribution of data may vary throughout projects.Thus, the efficacy of CPDP models depends on how to minimize these distribution differences between source and target projects [27].When creating a classifier model, including redundant and unnecessary features can make the final model perform worse [28].Consequently, a feature selection process must be used to remove these superfluous or unnecessary characteristics [29].
In previous CPDP studies, class imbalance and data distribution difference problems are present.Therefore, a number of factors drive our use of the data balancing method to handle the noisy imbalance nature of the dataset and feature selection technique in order to mitigate distribution differences in software prediction of defects: Improve the performance of the model: Unbalanced, noisy datasets and different data distributions can negatively affect the CPDP model's performance by introducing bias, which can cause overfitting and a reduction in generalization, which can un turn lead to incorrect predictions.The synthetic minority oversampling technique with edited nearest neighbors (SMOTE-ENN) can help improve an imbalanced dataset and eliminate noisy and irrelevant instances, and a feature selection approach, such as CFS, can help to minimize data distribution differences and enhance the performance of the model.

1.
Better feature representation: Minimizing noise and balancing the dataset to maintain the significant characteristics of the original data and reduce data distribution differences can help to find and choose the most relevant features.This can help the model learn more accurate feature representations and enhance model performance.

2.
Reduce overfitting: Imbalanced datasets and different data distribution can lead to overfitting of the model.When data are imbalanced, the model prioritizes the majority class and overlooks the minority class and, when data are distributed differently, the prediction of target data becomes ineffective.Balancing data and reducing noise from the dataset can help overcome the overfitting problem, simplifying the model ij order to learn from the minority class, and feature selection can help in minimizing data distribution differences to prevent model overfitting.
This paper presents a unique supervised domain adaptive cross project defect prediction (CPDP) framework termed SCAG-LSTM, which is based on feature selection and class imbalance techniques, to overcome the aforementioned issues.The fundamental goal of SCAG-LSTM is to lessen the problems of distribution differences, class imbalance, and the existence of noisy and irrelevant instances in order to increase the CPDP's predictive performance.To determine the efficacy of the SCAG-LSTM, experiments are carried out on the AEEEM and PROMISE datasets using the widely-used evaluation metrics F1-measure and AUC.Based on the experimental data, the SCAG-LSTM performs better than the benchmark techniques.
The key contributions of this work are as follows: 1.
In this research, we propose a novel CPDP model, SCAG-LSTM, that integrates SMOTE-ENN, CFS-BFS and Bi-LSTM with Bi-GRU and Attention Mechanism to construct a cross project defect prediction model that enhances software defect prediction performance.

2.
We demonstrate that the proposed novel domain adaptive framework reduces the effect of data distribution and class imbalance problems.

3.
We optimize the LSTM model with Bi-GRU and Attention Mechanism to efficiently capture semantic and contextual information and dependencies.4.
To verify the efficiency of the proposed approach, we conducted experiments on PROMISE and AEEEM datasets to compare the proposed approach with the existing CPDP methodologies.
This paper follows the following structure.Section 2 provides an overview of the relevant CPDP work.The presentation of our research methodology follows in Section 3. The experimental setups are shown in Section 4. The experimental results are presented in Section 5.The threats to internal, external, construct and conclusion validity are presented in Section 6, and conclusions and future work are covered in Section 7.

Related Work
In this section we briefly review related work on cross project defect prediction and domain adaptation.The key data processing techniques that address domain adaptation include feature selection, data balancing, and removing noisy and irrelevant instances from the dataset.

Cross-Project Defect Prediction
Software defect prediction (SDP) plays an essential role in software engineering as it helps in identifying possible flaws and vulnerabilities in software systems [6].A large amount of research has been done over time to improve software defect prediction methods' efficacy.To create reliable models that can spot flaws early in the software development lifecycle, researchers have looked into a number of techniques, such as machine learning, data mining, and statistical analysis.With regard to within-project defect prediction, the majority of the earlier SDP models relied on WPDP.The data used in WPDP come from the training and evaluation stages of the same project.Scholars have recently paid greater attention to the CPDP model [27,30].The goal of cross project defect prediction models is to leverage training data from other projects.Newly established project defects can be predicted using the prediction models.For cross project defect prediction, Liu et al. [31] proposed a two-phase transfer learning model (TPTL) that builds two defect predictors based on the two selected projects independently using TCA+ and combines their prediction probabilities to improve performance.The two closest source projects are chosen using a source project estimator.Zhou et al. [32] addressed the issue by performing a large-scale empirical analysis of 40 unsupervised systems.Three types of feature and 27 application variations were included in the free-source dataset used in the experiment.Models in the occurrence-violation-value based clustering family significantly outperformed models in the hierarchy, density, grid, sequence, and hybrid-oriented clustering, according to the experiment's findings.A cluster-based feature selection technique was used by Ni et al. [33] to choose the important characteristics from the reference data and enhance its quality.Their experiments on eight datasets demonstrated that their proposed technique outperformed WPDP, standard CPDP, and TCA+ in terms of AUC and F-measure.Abdu et al. [34] presented GB-CPDP, a graph-based feature learning model for CPDP that uses Long Short-Term Memory (LSTM) networks to develop predictive models and Node2Vec to convert CFGs and DDGs into numerical vectors.

Domain Adaptation
In cross project defect prediction, one typical challenge deals with the problem of imbalanced datasets and the existence of noisy and irrelevant instances among source and target projects.The issue of class imbalance has been noted in a number of fields [35] and significantly impairs prediction model performance in SDP.A number of methods, including undersampling, oversampling, and the synthetic minority oversampling technique (SMOTE), have been developed by researchers to lessen the effects of data imbalance and enhance prediction ability.For two reasons, undersampling strategies are widely utilized to address class imbalance among all approaches: one, they are faster [36], and second, they do not experience overfitting as oversampling techniques do [37].Elyan et al. [38] suggested an undersampling method based on neighborhoods to address the classification dataset's class imbalance.The outcomes of the trial verified that it is quick and effectively addresses problems with class overlaps and imbalance.In order to create growing training datasets with balanced data, Gong et al. [39] presented a class-imbalance learning strategy that makes use of the stratification embedded in the nearest neighbor (STr-NN) concept.They first leverage TCA and then the STr-NN technique is applied on the data to lessen the data distribution difference between the source and target datasets.In this area, managing the different data distributions across projects is another challenge.Using a variety of strategies, including feature selection, feature extraction, ensemble approaches, data preparation, and sophisticated machine learning algorithms, researchers have achieved great progress in this field.These methods seek to maximize computational efficiency and prediction performance while identifying the most relevant features.Results of an appropriate feature selection can shorten learning times, increase learning efficiency, and simplify learning outcomes.To enhance the effectiveness of cross project defect prediction, by employing a technique known as kernel twin support vector machine (DA-KTSVM) to learn the domain adaptation model, Jin et al. [27] attempted to maximize the similarity between the feature distributions of the source and target projects.Assuming that target features were successfully classified by the defect predictor, since they were matched to source instances, he trained the feature generator with the goal of matching distributions between two distinct projects.Kumar et al. [40] proposed a novel feature-selection approach that combines filter and wrapper techniques to select optimal features using Mutual Information with the Sequential Forward Method and 10-fold cross-validation.They carried out dimensionality reduction using a feature selection technique to enhance accuracy.
In the existing CPDP research, although many related issues have been discussed, including class imbalance learning, data distribution difference, features transformation, etc., the issue of noisy and irrelevant occurrences in the source and target domains, which can affect the CPDP performance, has only been briefly examined [26,27].Therefore, in order to enhance the effectiveness of cross project defect prediction, we propose a novel supervised domain adaptive framework called SCAG-LSTM.A summary of the related work discussed above is presented in Table 1, listing the datasets, techniques, and evaluation measures used, along with their advantages and limitations.
Built and assessed a two-phase CPDP transfer learning model (TPTL).

PROMISE datasets
Discovered that the model effectively lessened the TCA+ instability issue.
The study was an attempt to provide a process for choosing quality source projects.There are no suggestions for feature engineering, preprocessing techniques, or random datasets lacking comparable metrics.

Open-source dataset with 27 project versions
The performance of the various clustering-based models varied significantly, and the clustering-based unsupervised systems did not always perform better on defect data when the three types of features were combined.
The feature engineering improvements and time/cost improvements required for the chosen DP unsupervised models were not included in the study.

Relink and AEEEM
Better outcomes compared to WPDP, conventional CPDP, and TCA+ in terms of AUC and F-measure.
Limited emphasis was placed on feature selection in favor of instance selection in order to minimize the distribution divergence between the target and reference data.
Elyan et al., (2019) [37] Undersampling to eliminate any overlapped data points in order to address class imbalance in binary datasets.

Simulated and real-world datasets
We offer four approaches based on neighborhood searching with various criteria to find and remove instances of the majority class.
Processing times are lengthened when the application is limited to one minority class at a time.
KTSVMs with DA functions, or DA-KTSVM, were also employed as the CPDP model in this study.

Open-source datasets
According to their research, DA-KTSVMO was able to outperform WPDP models in terms of prediction accuracy as well as outperform other CPDP models.
The study recommended that the best use of the sufficient data already in existence be made, with consideration given to the reuse of data that are deficient.To validate the system, the model's performance feasibility should be assessed.

Methodology
In this section, we first present the framework of our proposed approach SCAG-LSTM.A sequence of actions, including feature selection, dataset balance, noisy instance removal, and model building, are then explained as part of our proposed methodology.

Proposed Approach Framework
We illustrate our proposed approach for CPDP employing machine learning models (Bi-LSTM, Bi-GRU and Attention mechanism) combined with a feature selection method (CFS with best-first-search) and data sampling method (SMOTE-ENN) in this section.The datasets were obtained from the AEEEM and PROMISE as source (S) and target (T) projects.In order to select the features that are more relevant to the target class, feature selection is employed.Then a data balancing method is applied to balance the training dataset and remove noisy and irrelevant instances in source and target domains.Selected balanced source project (Ss) is then used to train the model.Finally, the trained model is utilized to predict the label of the selected balanced target project (Ts), and the result is then compared based on AUC and F1-measure performance metrics.Figure 1 demonstrates the full workflow of the proposed approach.

Proposed Features Selection Approach
The prediction performance of CPDP models might be lowered by the presence of redundant or irrelevant features.By focusing on the most relevant features, the model may better capture the underlying patterns in the data and produce more accurate forecasts of defects in new projects [41].We employ correlation-based feature selection (CFS) in the proposed framework, which chooses a subset of characteristics by taking into consideration each feature's unique predictive capacity, as well as the degree of redundancy among them.The best-first search technique is employed to discover a feature subset in which these features have a strong correlation with regard to the target class labels, while having a low correlation within each other (Figure 2).

Proposed Features Selection Approach
The prediction performance of CPDP models might be lowered by the presence of redundant or irrelevant features.By focusing on the most relevant features, the model may better capture the underlying patterns in the data and produce more accurate forecasts of defects in new projects [41].We employ correlation-based feature selection (CFS) in the proposed framework, which chooses a subset of characteristics by taking into consideration each feature's unique predictive capacity, as well as the degree of redundancy among them.The best-first search technique is employed to discover a feature subset in which these features have a strong correlation with regard to the target class labels, while having a low correlation within each other (Figure 2).

Proposed Features Selection Approach
The prediction performance of CPDP models might be lowered by the presence of redundant or irrelevant features.By focusing on the most relevant features, the model may better capture the underlying patterns in the data and produce more accurate forecasts of defects in new projects [41].We employ correlation-based feature selection (CFS) in the proposed framework, which chooses a subset of characteristics by taking into consideration each feature's unique predictive capacity, as well as the degree of redundancy among them.The best-first search technique is employed to discover a feature subset in which these features have a strong correlation with regard to the target class labels, while having a low correlation within each other (Figure 2).CFS begins by employing a correlation metric to determine the value of each individual feature F and each feature set f 1 , f 2 , . .., f n , as illustrated in Algorithm 1.It initializes each feature as f i with the empty set S. Then, it selects the feature with the highest merits and adds it to the selected subset, while removing it from the remaining features.The process continues until the merit no longer changes considerably.Eventually, the algorithm returns the chosen subset S, which includes the most informative information for CPDP defect prediction.
Initialize an empty set S and sets f 1 , f 2 , . .., f n with each feature f i 2.
Compute the merit of each feature f i in F. Compute the weight of each individual feature set ( f 1 , f 2 , . .., f n ) using a suitable metric (e.g., correlation).

3.
Select the feature with the highest weight from F, add it to S, and remove it from F 4.
Compute the weight of the current subset S 5.
Select the new feature from F with the highest merit, compute the merit of the updated subset, and compare it with the previous merit 6.
If the merit of the new subset is not better than the previous subset, 7.
Remove the selected feature from S 8.
Update S, remove the selected feature from F 10.
Repeat steps 5-9 until the merit does not change significantly Return the selected features S

Proposed Imbalanced Learning Approach
The term "class imbalance" refers to circumstances in which there are significantly fewer examples of one class than of others.Models trained on imbalanced datasets tend to exhibit a bias towards the majority class, leading to significant false negative rates, when real defects are neglected or misclassified as non-defective cases.Moreover, the addition of noisy and irrelevant occurrences among source and target projects also influences the prediction performance of CPDP models.A number of data balancing techniques have been developed to address the issue of imbalanced classes.Data sampling is the most widely used balancing approach.While undersampling approaches can lower the number of instances from the majority class [42], oversampling strategies can boost the representation of the minority class [43].Our proposed approach uses the Synthetic Minority Oversampling Technique (SMOTE) [44] with Edited Nearest Neighbor (ENN) [45], which combines the ability of Edited Nearest Neighbor to clean data and the ability of SMOTE to generate synthetic data in order to produce a dataset that is more representative and balanced.As seen in Figure 3, the SMOTE-ENN approach consists of two basic steps: oversampling the minority class using SMOTE and cleaning the generated dataset using Edited Nearest Neighbor.The pseudocode of SMOTE-ENN is demonstrated in Algorithm 2. The SMOTE algorithm first separates minority m_i and majority m_a samples based on y class labels and then determines the k closest neighbors from the same class for each instance in the minority class m_i.Then randomly choose one of the k nearest neighbors x and compute the difference between the feature vectors of the instance and the selected neighbor ( x − x i ).Multiply this difference by a random value δ between 0 and 1 and add it to the feature vector of the instance x i .This builds a synthetic instance x new .After oversampling the minority class m_i, the final dataset may still contain noisy and irrelevant instances.This is when the Edited Nearest Neighbor (ENN) algorithm comes into play.The ENN algorithm finds three nearest neighbors for each instance x j and removes instances x j that are misclassified by their nearest neighbors, hence improving the quality of the dataset.Separate minority and majority class as m_i and m_a 2.
Identify m_i and m_a samples based on class labels in y 3.
For each m_i sample x i , calculate, k-nearest neighbors.

4.
Select one of the random nearest neighbors x 5.
Generate a new synthetic sample Repeat the above steps until the desired balance ratio is achieved 7.
For each instance x j in the augmented dataset: 8.
Find the three nearest neighbors of x j 9.
If x j is misclassified by its three nearest neighbors, delete x j Return Resampled Dataset

Model Building
Our learning model architecture comprises four layers, starting with the Bi-LSTM layer which consists of 220 nodes.The second and third layers are Bi-GRU and LSTM, having 220 nodes each.For acquiring important features, we used the attention layer.All dense layers utilize the tanh as their activation function.The activation function applied in the last layer is softmax.The term "dropout" describes the possibility that any given node will be eliminated or dropped out, as seen in Figure 4. Algorithm 3 illustrates our model's detail processing.In the proposed approach, for each source dataset S[n], the CFS algorithm: (1) Calculates the merit of each feature and selects the top features based on the highest merit.
(2) Selects the top k features based on the highest merit and creates the S_selected dataset.

Model Building
Our learning model architecture comprises four layers, starting with the Bi-LSTM layer which consists of 220 nodes.The second and third layers are Bi-GRU and LSTM, having 220 nodes each.For acquiring important features, we used the attention layer.All dense layers utilize the tanh as their activation function.The activation function applied in the last layer is softmax.The term "dropout" describes the possibility that any given node will be eliminated or dropped out, as seen in Figure 4. Algorithm 3 illustrates our model's detail processing.In the proposed approach, for each source dataset  ], the CFS algorithm: (1) Calculates the merit of each feature and selects the top features based on the highest merit.
(2) Selects the top k features based on the highest merit and creates the _ dataset.
The top k features are selected based on the highest merit values, and the _ dataset is created by combining the selected features from all source datasets.The merit of a feature f in a dataset  can be calculated using the following Equation (1): where (, ) is a feature selection correlation metric.The selected features from all source datasets are combined to create the _ dataset.Then the CFS algorithm selects the features from the target dataset  that correspond to the features selected for the source dataset   .The selected features from all target datasets are combined to create the _ dataset.The _ dataset is created as in Equation ( 2): The top k features are selected based on the highest merit values, and the S_selected dataset is created by combining the selected features from all source datasets.The merit of a feature f in a dataset S can be calculated using the following Equation (1): where Metric( f , S) is a feature selection correlation metric.The selected features from all source datasets are combined to create the S_selected dataset.Then the CFS algorithm selects the features from the target dataset T that correspond to the features selected for the source dataset S[n].The selected features from all target datasets are combined to create the T_selected dataset.The T_selected dataset is created as in Equation ( 2): where f is a feature that exists in both the target dataset T and the S_selected dataset.
To address class imbalance issues and remove noisy and irrelevant instances in the source and target datasets, the SMOTE-ENN (Synthetic Minority Over-sampling Technique-Edited Nearest Neighbors) technique is applied.SMOTE-ENN works by: (1) Generating synthetic samples of the minority class using the k-nearest neighbors' algorithm.
(2) Removing any noisy or redundant samples from the resampled dataset using the Edited Nearest Neighbors (ENN) algorithm.
The resampled datasets are denoted as S_resampled and T_resampled.Then split the resampled source and target datasets into training and testing sets.The split is carried out using an 80:20 ratio, where 80% of the data are used for training, and 20% is used for testing.The training and testing sets are denoted as: S_x, S_y (Source dataset) T_x, T_y (Target dataset) The algorithm builds a Sequential Neural Network model with the following layers: (1) Bi-LSTM (Bidirectional Long Short-Term Memory) with 220 nodes (2) Bi-GRU (Bidirectional Gated Recurrent Unit) with 220 nodes (3) LSTM with 220 nodes (4) Attention Layer The Bi-LSTM and Bi-GRU layers are combined to optimize LSTM to capture the sequential and contextual information in the data.The attention layer is added to allow the model to focus on the most relevant parts of the input data when making predictions.Then, it trains the model using the T_resampled dataset.The training process can be represented by the following Equation (3): where model is the sequential neural network model built in the previous step, and T_resampled is the resampled target dataset.Then, it uses the trained model to predict the defects in the T_y (testing) dataset.The prediction can be represented by the following Equation ( 4): where Result is the predicted defects for the T_y dataset.The algorithm returns the Result as the final output, which represents the predicted defects for the target dataset T_y.
The Bi-directional Long Short-Term Memory (LSTM) and Gated Recurrent (GRU) Layer: The suggested model first leverages a bi-directional LSTM network to learn the contextual and semantic features.Due to its ability to ease the vanishing gradient problem and depict long-term dependency, this model has been selected [46].Bi-LSTM is an expansion of LSTMm which comprises two LSTM layers, one processing the sequence in the forward direction and the other in the backward.The sequence is processed in reverse order by the backward LSTM layer, whereas the sequence is processed from start to finish by the forward LSTM layer.To create the final output, the hidden states from both layers are concatenated [47].The bi-LSTM is defined by Equations ( 5)- (7), where → h t is the state of the forward LSTM, → h t is the state of the backward LSTM, and ⊕ signifies the operation of concatenating two vectors.To generate the final hidden state h t = [ → h t , ← h t ] at time step t, the forward layer's → h t final output and the ← h t backward layer's reverse output are merged.The Bi-directional Gated Recurrent Unit (Bi-GRU) has the ability to learn knowledge from prior and subsequent data when dealing with the present data.Two unidirectional GRUs pointing in different directions are used to determine the state of the bi-GRU model.One GRU starts at the beginning of the data series and goes forward, and another GRU starts at the end and moves backward.This makes it possible for information from the past and the future to affect the states that are currently in effect now.The bi-GRU is defined by Equations ( 8)- (10), where → h t is the state of the forward GRU, ← h t is the state of the backward GRU, and ⊕ signifies the operation of concatenating two vectors.The model is able to produce more precise predictions because of this bi-directional representation, which records contextual information from both the past and the future.
Attention Layer: We embed the attention layer solely in order to amplify the influence of features and focus more on key features, as shown in Equation (11).First, we input the values u it representing the hidden states to be scaled.
Then, to identify the important sequence's properties, u n is employed.The normalized adaptive weights are then produced by the model using a softmax process by computing the dot product between u it and u n , as shown in Equation (12).
Ultimately, the sequence vector is generated by the weighted summation of each node, as shown in Equation (13).

Experimental Setups
We provide a detailed description of our experimental setup in this part, together with benchmark datasets, evaluation metrics, baseline methods and research questions.

Benchmark Datasets
The experiments were conducted on 10 open-source software projects from AEEEM (five JAVA open-source projects) and PROMISE (five projects randomly picked from PROMISE).The difficulty of the software development process and the complexity of the software code is considered when assembling and gathering the AEEEM database by Dambros [48].The PROMISE dataset, which contains the most varied project features, was created by Jureczko and Madeyski [49].Table 2 presents the detailed information of selected projects, including the projects names, numbers of instances and the percentage of defective modules.

Evaluation Metrics
We analyze our suggested model performance based on two evaluation metrics, F1measure and AUC.The F1-measure is a typical evaluation metric used to evaluate the performance of a classification model.It is the precision and recall harmonic mean that balances the two metrics, as shown in Equation ( 14): Area Under the Receiver Operating Characteristic curve (AUC) is used to evaluate the degree of differentiation obtained by the model.Decision thresholds, like recall and precision, are insensitive to this.All potential classification thresholds are represented by a graphic that shows the true positive rate on the y-axis and the false positive rate on the x-axis, as shown in Equation ( 15).The higher the AUC, the better the prediction: where the numbers of positive and negative cases are represented by M and N, and the average positive samples rank is Σ ins i ϵ Positive Class rank(ins i ).

Baseline Models
To test the efficacy of our SCAG-LSTM method, a comparative analysis is conducted that evaluates its prediction performance against a total of six state-of-the-art CPDP approaches, three for AEEEM and three for PROMISE datasets.These approaches are detailed succinctly in Table 3.

Research Questions
In this section, we will discuss the motivations, along with the research problems.This work aims to address the following research questions.RQ1: Does balancing data and removing noisy instances from data improve the performance of our proposed model SCAG-LSTM?
This research question investigates the effectiveness of our data-balancing method to improve the performance of the proposed model in CPDP.
RQ2: Does the feature selection approach suggested in this paper have any impact on the performance of the model SCAG-LSTM?
This research question analyzes the usefulness of our feature selection approach to improve the performance of the proposed model in CPDP.
RQ3: How effective is our SCAG-LSTM model?How much improvement can SCAG-LSTM achieve over the related models?
This research question seeks to determine how well the proposed method performs in CPDP in comparison to existing state-of-the-art approaches.
The motivation for the above mentioned research questions is driven by the implementation of feature selection and data balancing approaches in CPDP studies.The most recent CPDP research [22][23][24] indicates that, in order to make a model that can accurately predict defects and prevent the model from being biased toward the majority class, it is crucial to apply data balancing methods to handle the imbalanced nature of the dataset and remove noisy and irrelevant instances.Rao et al. [42] integrated an undersampling method to balance the imbalanced datasets.Sun et al. [51] explored undersampling techniques and proved that undersampling easily handles the imbalanced nature of the data.Moreover, feature selection methods reduce data distribution differences in order to increase the efficiency and efficacy of defect prediction models [52].Lei et al. [53] introduced a cross project defect prediction technique based on feature selection and distance weight instance transfer.They showed the efficiency of feature selection on the suggested method.Zhao et al. [50] suggested a multi-source-based cross project defect prediction technique MSCPDP, which can handle the problem of data distribution differences between source and target domains.

Experimental Results
In this section, we present the experimental results and provide the answers to the research questions from Section 4.4.

Research Question-RQ1
In order to address RQ1, Table 4 reports the proposed model's performance on the AEEEM dataset with and without data balancing, and Table 5 reports the prediction model's performance on the PROMISE dataset.Our model averages for the unbalanced datasets (F1-measure and AUC) are 0.503 and 0.466 for AEEEM and 0.380 and 0.510 for PROMISE.The average value of our proposed model on the balanced datasets (F1-measure and AUC) are 0.891 and 0.801 for AEEEM and 0.643 and 0.680 for PROMISE.From Tables 3 and 4, it can be observed that the overall performance of the prediction model trained from the data processed by data balancing is significantly improved, with an average F1-measure improvement of about 77.34% and AUC improvement of about 71.98% on AEEEM and F1-measure improvement of about 69.21% and AUC improvement of about 33.33% on PROMISE, as compared to without data balancing.The Box plots of performance measures for the AEEEM datasets are shown in Figures 5 and 6, displaying the performance measures for the PROMISE datasets with and without data balancing (F1-measure and AUC).Compared to the data without data balancing, we can observe that the prediction models trained using the data processed by data balancing have larger numerical intervals in the overall distribution.As a result, we can conclude that the data balancing strategy put forth in this study works well in enhancing the CPDP model's performance across all datasets.

Research Question-RQ2
In order to address RQ2, Tables 6 and 7 present the prediction model's performance on the AEEEM and PROMISE dataset with and without feature selection.The average value of our model on the datasets without feature selection (F1-measure and AUC) are 0.678 and 0.637 on AEEEM and 0.443 and 0.496 on PROMISE.The average value of our model on the feature selected datasets (F1-measure and AUC) are 0.891 and 0.801 on AEEEM and 0.643 and 0.680 on PROMISE.From Tables 5 and 6, it can be observed that the overall performance of the prediction model trained from the data processed by the feature selection is significantly improved, with an average F1-measure improvement of pared to without feature selection.The Box plots of performance measures for the AEEEM datasets are shown in Figure 7 and the Box plots of performance measures for the PROM-ISE datasets are shown in Figure 8, with and without feature selection (F1-measure and AUC).We observe that our proposed model shows good results on the feature-selected datasets, indicating that the proposed model performs well and that the feature selection method plays a significant role in enhancing prediction performance.

Research Question-RQ3
In order to address RQ3, we compared our model results with the results of baseline models based on two metrics: AUC and F1-measure.Tables 8 and 9 compare the results of

External Validity
The term "external validity" describes a scientific research study's ability to be transferred to other investigations.The primary risk to the external validity of this study is from our attempt to identify and collect diverse datasets from different AEEEM and PROMISE projects.For our evaluation datasets, we chose five open-source Java programs from the PROMISE and five open-source applications from the AEEEM dataset.We will be able to confirm the correctness of our strategy with additional experiments on different datasets.Furthermore, we cannot claim that our findings are broadly applicable.To ensure that the results of this study can be applied to a larger population, future replication is required.

Construct Validity
Construct validity pertains to the design of the study and its ability to accurately represent the true objective of the investigation.We have used a methodically related work evaluation technique to combat these hazards to our study design.We double-checked and made multiple adjustments to the research questions to make sure the topic was pertinent to the study's objective.The metrics taken into consideration can also endanger our research.We use static code metrics exclusively for fault prediction.As a result, we are unable to say if our findings apply to additional parameters.Nevertheless, much earlier research also commonly used static code metrics.Further research on performance measures is planned.The creation of ML models is another danger.We looked at a number of factors that might have affected the study, such as feature selection, data pre-processing, data balancing techniques, how to train the models, and which features to evaluate.However, the methods used here are accurate enough to guarantee the validity of the study.

Conclusion Validity
The degree to which the research conclusion is derived in a reasonable manner is referred to as conclusion validity.We conducted numerous experiments on sufficient projects in this study to reduce the risk to the validity of the conclusions.As a result, the outcomes derived from the collected experimental data should be statistically reliable.

Conclusions and Future Work
Defect prediction and quality assurance are critical in today's quickly changing software development environment to guarantee the stability and dependability of software projects.Predicting errors across many software projects with accuracy and efficiency is a major challenge in this industry.In order to improve the predictive performance of the cross-project defect prediction (CPDP) model, we combined a variety of data preprocessing, feature selection, data sampling, and modeling approaches in our study to propose a comprehensive approach to addressing the difficulties associated with CPDP.To improve the existing state-of-the-art approaches to predicting software defects, we proposed a novel domain adaptive approach, which first integrates CFS-BFS and SMOTE-ENN to overcome the domain adaptive problems related to data distribution differences and imbalance class, enhancing the model's performance and making it more robust and capable of handling real-world software defect prediction scenarios.Furthermore, it optimizes LSTM with Bi-GRU and Attention Mechanism to capture complex patterns and dependencies in the data, while the attention layer provided insights into which features and instances are most influential in making predictions.We conducted a number of tests on 10 publicly available apache_lucene, equinox, eclipse_jdt_core, eclipse_pde_ui, and mylyn (AEEEM) and predictor models in software engineering (PROMISE) datasets in order to assess the efficacy of the suggested models, i.e., the baseline models, active learning-based method (ALTRA), multi-source-based cross-project defect prediction method (MSCPDP), two-phase feature importance amplification method (TFIA), domain adaptive kernel twin support vector machines method (DA-KTSVMO), two-phase transfer learning method (TPTL), and generative adversarial long-short term memory neural networks method (GB-CPDP) and the outcomes were contrasted.According to our findings, the suggested model outper-

Figure 1 .
Figure 1.Overview of the proposed methodology for CPDP.

Figure 1 .
Figure 1.Overview of the proposed methodology for CPDP.

Figure 1 .
Figure 1.Overview of the proposed methodology for CPDP.

Algorithms 2024, 17 , 175 9 of 27 Figure 3 .
Figure 3. Flowchart of SMOTE-ENN.The pseudocode of SMOTE-ENN is demonstrated in Algorithm 2. The SMOTE algorithm first separates minority _ and majority _ samples based on  class labels and then determines the  closest neighbors from the same class for each instance in the minority class _.Then randomly choose one of the k nearest neighbors  and compute the difference between the feature vectors of the instance and the selected neighbor ( −  ).Multiply this difference by a random value  between 0 and 1 and add it to the

Figure 4 .
Figure 4. Architecture of the proposed model.

Figure 4 .
Figure 4. Architecture of the proposed model.

Figure 5 .
Figure 5. Boxplot of F1-measure and AUC of model with and without Smote-Enn on AEEEM.

Figure 6 .
Figure 6.Boxplot of F1-measure and AUC of model with and without Smote-Enn on PROMISE.

Figure 5 . 27 Figure 5 .
Figure 5. Boxplot of F1-measure and AUC of model with and without Smote-Enn on AEEEM.

Figure 6 .
Figure 6.Boxplot of F1-measure and AUC of model with and without Smote-Enn on PROMISE.Figure 6. Boxplot of F1-measure and AUC of model with and without Smote-Enn on PROMISE.

Figure 6 .
Figure 6.Boxplot of F1-measure and AUC of model with and without Smote-Enn on PROMISE.Figure 6. Boxplot of F1-measure and AUC of model with and without Smote-Enn on PROMISE.

Figure 7 .
Figure 7. Boxplot of F1-measure and AUC of model with and without FS on AEEEM.Figure 7. Boxplot of F1-measure and AUC of model with and without FS on AEEEM.

Figure 7 . 27 Figure 8 .
Figure 7. Boxplot of F1-measure and AUC of model with and without FS on AEEEM.Figure 7. Boxplot of F1-measure and AUC of model with and without FS on AEEEM.Algorithms 2024, 17, 175 19 of 27

Figure 8 .
Figure 8. Boxplot of F1-measure and AUC of model with and without FS on PROMISE.

Table 1 .
Summary of Related Work.
Separate minority and majority class as _ and _ 2. Identify _ and _ samples based on class labels in  3.For each _ sample  , calculate, k-nearest neighbors.4. Select one of the random nearest neighbors  5. Generate a new synthetic sample  =  +  ( −  ) ℎ  1 6.Repeat the above steps until the desired balance ratio is achieved 7.For each instance  in the augmented dataset: 8. Find the three nearest neighbors of  9.If  is misclassified by its three nearest neighbors, delete 1 , S 2 , . .., S n } -Target Dataset: T

Table 2 .
Description of the datasets that we have chosen.

Table 3 .
Summary of the Baseline models for comparison.

Table 5 .
F1-measure and AUC of proposed model with and without data balancing method on PROMISE.

Table 6 .
F1-measure and AUC of proposed model with and without FS method on AEEEM.