Feature Selection Using Golden Jackal Optimization for Software Fault Prediction

: A program’s bug, fault, or mistake that results in unintended results is known as a software defect or fault. Software ﬂaws are programming errors due to mistakes in the requirements, architecture, or source code. Finding and ﬁxing bugs as soon as they arise is a crucial goal of software development that can be achieved in various ways. So, selecting a handful of optimal subsets of features from any dataset is a prime approach. Indirectly, the classiﬁcation performance can be improved through the selection of features. A novel approach to feature selection (FS) has been developed, which incorporates the Golden Jackal Optimization (GJO) algorithm, a meta-heuristic optimization technique that draws on the hunting tactics of golden jackals. Combining this algorithm with four classiﬁers, namely K-Nearest Neighbor, Decision Tree, Quadrative Discriminant Analysis, and Naive Bayes, will aid in selecting a subset of relevant features from software fault prediction datasets. To evaluate the accuracy of this algorithm, we will compare its performance with other feature selection methods such as FSDE (Differential Evolution), FSPSO (Particle Swarm Optimization), FSGA (Genetic Algorithm), and FSACO (Ant Colony Optimization). The result that we got from FSGJO is great for almost all the cases. For many of the results, FSGJO has given higher classiﬁcation accuracy. By utilizing the Friedman and Holm tests, to determine statistical signiﬁcance, the suggested strategy has been veriﬁed and found to be superior to prior methods in selecting an optimal set of attributes.


Introduction
Software's flaws can harm its reliability and quality, necessitating more maintenance and an effort to rectify it.While testing results can aid software development teams in detecting bugs, testing complete software modules is costly and time-consuming.The performance of various software development tasks by individuals can lead to the emergence of multiple software bugs over time, ultimately resulting in user dissatisfaction.Therefore, early identification of software flaws is one of the primary research areas of interest.Software fault prediction [1,2] is the process of spotting potential flaws or defects in software before they happen using data analysis and machine learning methods.This can aid developers in effectively identifying and resolving potential problems, producing software that is of higher quality and contains fewer flaws.There are several approaches to predicting software faults, such as statistical and machine learning methods [3,4].These techniques involve analyzing data from past software projects to identify patterns and trends that may indicate potential faults.The data used for this analysis may include testing and debugging logs, source code, and other relevant information.Commonly utilized methods for predicting software faults include DT [5,6], SVM [7], Neural Networks [8], LR [9], and many more.These techniques involve analyzing data to identify patterns and trends indicating potential flaws or bugs.In addition to machine learning techniques, code analysis and testing are employed to predict software faults.These methods involve scrutinizing the software code for potential issues and performing various tests to detect and rectify bugs.Predicting software failures is a vital aspect of software development that can enhance the quality and reliability of software.By effectively identifying and resolving potential issues, developers can reduce the probability of bugs and improve the overall user experience.
In the field of machine learning, one of the critical tasks is Feature Selection (FS) [10,11].The process entails determining the most significant features that can improve the precision of predictive models.Several methods are available for FS, each with advantages and limitations.We can say FS is a crucial stage in software fault prediction that aids in locating the most important predictors of software faults.Feature selection is required when the available dataset is extensive and includes many features or variables, making it challenging to analyze and interpret the findings accurately.The most crucial characteristics should be chosen to simplify the analysis and increase the precision of the software fault prediction model.For FS in software fault detection, various methods are employed.One of the methods used for FS is the filter method [12].These methods employ statistical properties such as correlation with the target variable or variance to select features.Some examples of filter methods include chi-square [12] and ANOVA [12,13].Wrapper methods [14], on the other hand, evaluate subsets of features by training and testing a predictive model on each subgroup.Recursive feature elimination [15] and forward/backward selection [16] are examples of wrapper methods.Embedded methods [17,18] combine feature selection with model training.Lasso [19] and ridge regression [20] are two examples of embedded methods used to identify the essential elements for prediction.The ridge penalty reduces the regression coefficient estimate, but not precisely to zero.For this reason, the incapability of ridge regression to perform variable selection has long been a source of criticism.As a result, penalized regression techniques like elastic-net, adaptive elastic-net, and adaptive-lasso are more beneficial for variable selection.Meanwhile, dimensionality reduction methods aim to reduce the number of features while retaining the most relevant information for prediction.Some examples of dimensionality reduction methods include Principal Component Analysis (PCA) [21] and t-SNE (neighbor encoding) [22].It should be emphasized that the selection of the method depends on the specific problem at hand.Therefore, trying multiple ways and comparing their performance is often beneficial to select the best one for a specific situation.Feature selection plays a crucial role in software fault prediction by identifying the most significant variables or features that influence the likelihood of software faults.
Researchers have used many different types of FS algorithms.Evolutionary-based algorithms [23] and swarm-based algorithms [24] have been used for the feature selection approaches.This study's primary objective is to enhance classification accuracy while minimizing errors.The Genetic Algorithm (GA) [25] utilizes a computational technique that is guided by natural selection and genetic evolution in living organisms.This method involves utilizing a group of potential solutions, using genetic techniques such as selection, crossover, and mutation to create new possible solutions, and assessing their efficacy based on a specified function.Particle Swarm Optimization (PSO) [26,27] emulates the movement of a set of particles exploring a search space with multiple dimensions, where each particle embodies a prospective solution to the problem.Through continuous evaluation of each particle's fitness, the algorithm adjusts their position and velocity by considering individual and group experiences.Its ultimate goal is to find the most optimal solution.DE (Differential Evolution) [28] is an optimization algorithm that operates on a population of candidate solutions using a set of operators, including mutation, crossover, and selection, to converge toward the optimal solution.It uses a difference vector to create new solutions and iteratively improves them by comparing their fitness with the current population.Ant Colony Optimization (ACO) [29] is a metaheuristic optimization algorithm that simulates the foraging behavior of ants to find the optimal solution in a search space.The algorithm uses pheromone trails deposited by ants to guide the search toward the optimal solution.Any algorithm's performance depends on how the user sets its parameter values and utilizes them to find the answer, resulting in its output.
Generally, feature selection is a common task in machine learning.The goal is to identify the most relevant subset of features (i.e., input variables) that are most informative for a given prediction task.The process of feature selection involves an optimization problem that seeks to identify the ideal set of features that maximizes the performance of the model while minimizing its complexity.The objective of the paper is to apply effective FS methods to uncover a subset of features that produces a precise and easy-to-interpret model.The objective of feature selection is to improve the quality and usability of the machine learning model for real-world applications.
The main contribution here is it presents a new feature selection method called feature selection using Golden Jackal Optimization (FSGJO) [30].Golden Jackal Optimization (GJO) in FS aids in identifying the appropriate set of features.GJO mimics the hunting behavior of golden jackals, known for their cooperative hunting strategy and ability to adapt to changing environments.The algorithm consists of a population of candidate solutions, called jackals, that move around the search space for the ideal solution.The advantage of GJO is its ability to handle complex, high-dimensional optimization problems with multiple objectives.GJO is less likely to get stuck in local optima, which can be a problem for other optimization algorithms because GJO uses a combination of exploratory and exploitative search strategies, which allows it to escape local optima and continue searching for better solutions.The efficacy of the newly developed algorithm has been evaluated against several other feature selection (FS) algorithms, including FSGA, FSDE, FSPSO, and FSACO.Thus, it will be an effective method for feature selection.The new algorithm FSGJO and other FS methods have been used on classification models such as KNN, DT, NB, and QDA to check which model is giving the best accuracy output.The results of the new algorithm were compared with other FS methods for their significance.Software developers can enhance the precision and efficiency of their software fault prediction models by carefully selecting the most appropriate features.
The arrangement of this paper is as follows: in Section 2, the literature on feature selection algorithms is explored, while Section 3 introduces the GJO algorithm, and Section 4 explains the FSGJO approach.Section 5 details the experimental results and analysis, followed by a statistical analysis in Section 6.Finally, the conclusion is presented in Section 7.

Literature Review
Software fault prediction methods involve analyzing software code, metrics, or historical data to detect possible faults that may occur during the development process or after the software has been released.These techniques can help developers proactively identify and fix potential defects before they become significant issues.After software development, software testing [31] is conducted to ensure that the software meets the defined requirements and to detect and resolve any defects or problems that may have been overlooked during the development phase.In summary, software fault prediction is typically performed during development, while software testing is performed after the software has been developed.Both techniques are essential for ensuring high-quality software meets user requirements.In software fault prediction (SFP), classification [32][33][34][35][36][37] is a commonly used technique to predict whether a particular module or component of software contains a fault or defect.
Sonali and Divya [38] introduced a model, known as the Linear Twin Support Vector Machine (LSTSVM), to predict defective software modules.The model incorporates feature selection techniques and was evaluated on four datasets-CM1, PC1, KC1, and KC2.The study reported encouraging outcomes on the latter three datasets.Turabieh et al., in 2019 [39], took a dataset from the Promise repository which was iteratively subjected to three wrapper feature selection (FS) algorithms-(BPSO), (BGA), and (BACO)-which were applied iteratively, and received results with an average of 0.8358 over all datasets.Ezgi and Selma [40] proposed a hybrid approach using an artificial bee colony and differential evolution, which helps to select a relevant set of features without reducing accuracy.Ibrahim et al. conducted a study in 2017 [41] where they utilized the BSA for feature selection and Random Forest as a classifier on PC1, PC2, PC3, and PC4, which resulted in practical outcomes.In the study [42], the authors employed PSO as a feature selection method and the bagging technique as a classifier.They used eleven classifiers and nine samples from the NASA repository.Except for SVM, their future work, all classifier performances improved after comparing the findings with their methodology.On some of the well-known NASA datasets, authors in [43] combined the Centroid Bat Approach (CBA-SVM) and Support Vector Machine (SVM) methods, contrasted the outcomes with those of other ways, and discovered that their strategy was producing promising results.The authors in [44] employed the bagging technique with the GA and PSO metaheuristic methods to enhance performance.They discovered that the results of the two algorithms were comparable, but the combination with bagging exhibited superior outcomes.Authors in [45] used four datasets, namely PC1, PC2, PC3, and PC4, and tested with a correlationbased feature selection technique with five classifiers.Finally, they found out that CFS with RF has the best performance.There are many different FS algorithms, such as the electric field algorithm [46], Jaya Algorithm [47], RHSFOS [48], FSBWO [49], and many more, that can be tested on the software fault prediction datasets.The author in [50] has done feature selection using the firefly algorithm with SVM, KNN, and NB classifiers achieving better classification accuracy with FS.
The FS approaches have various parameters that control the coming accuracy.So, tuning all those parameters is necessary, and for every problem, it will be different.

Summary of Golden Jackal Optimization Algorithm
The Golden Jackal Optimization (GJO) algorithm is a meta-heuristic optimization technique that draws ideas from the hunting pattern of golden jackals.This algorithm aims to emulate the hunting strategy of these opportunistic predators, known for their adaptability to diverse environments.By doing so, GJO seeks to solve optimization problems efficiently and effectively.
The primary stages of hunting for a golden jackal pair are outlined below: 1.
Locating the prey and advancing towards it.

2.
Trapping the prey and agitating it.

3.
Attacking and capturing the prey.
Like other meta-heuristics, the GJO is a population-based technique that initiates with a randomized distribution of the first solution across the search space, as shown in Equation (1).
where X min is lower bound, and X max is upper bound, and rand is a function whose value range between 0 to 1.The initial matrix Prey (X prey ) is represented in Equation (2) which is generated during initialization, where the top two fittest members are a pair of jackals.
As shown in Equation (2), it involves p preys and q variables.The position (location) of each prey represents the parameters of a particular solution.As part of the optimization procedure, a fitness function is utilized to assess the appropriateness of each prey.As described in Equation (3), the fitness values of all prey are gathered in a matrix, where the F matrix holds the fitness values of every prey.X p,q represents the value of the p th dimension of the q th prey.The optimization involves p preys, and the objective function is denoted by F. In the hunting patterns of golden jackals, the male jackal is considered the most suitable prey, followed by the female jackal as the second fittest.The positions of the prey are acquired by the jackal pair accordingly.
Due to their inherent nature, jackals are adept at identifying and pursuing prey, but at times the prey may prove elusive and manage to evade them.As a result, the jackals must resort to exploring alternative prey, and this is referred to as the exploration stage.The male jackal is responsible for leading the hunt, with the female jackal following in pursuit.The updated position of the male jackal is shown in Equations ( 4) and (5), where i corresponds to the current iteration.The prey's position vector is denoted by X prey .X FM represents the location of the female jackal, and X M represents the location of the male jackal.The revised position of male jackal is symbolized as X 1 , and the revised position of female jackal is symbolized as X 2 with respect to the prey.The energy the prey uses to evade is represented by e and is determined using Equation (6).
e = e 0 * e 1 (6) where e 0 denotes the initial energy and e 1 indicates the decreasing energy of the prey.
The variable of Equations ( 7) and ( 8) includes r as a random integer whose value ranges between (0,1) and c 1 represents a constant value set at 1.5.The I signifies the max number of iterations and current iteration is denoted by i.Additionally, the value of e 1 is gradually reduced in a linear manner from 1.5 to 0 over the course of the iterations.
Equations ( 4) and ( 5) involve the calculation of the distance between the jackal and prey, represented as X(i) − s1 * X prey (i).Depending on the evading energy of the prey, this distance is either added or subtracted from the current position of the jackal.The two equations utilize a vector s1, which consists of a set of random numbers that adhere to the Levy distribution and signify the Levy movement.To simulate the movement of the prey in a Levy fashion, the equation multiplies the vector s1 with the Prey vector, as shown in Equation (9).
The levy flight function, denoted by LF(x), is computed in Equations ( 10) and (11), where v ranges between 0 to 1 and δ is generally set to 1.5.
Finally, the Equation (12) shows the updated positions of the jackals are obtained by averaging the results of Equations ( 4) and (5).
In a mathematical model, the cooperative hunting behavior of a male jackal and female jackal is represented in Equations ( 13) and (14), respectively, where i denotes the current iteration, X prey refers to the position vector of the prey, and X M (i) refers to the location of the male jackal, and X FM (i) refers to the location of female jackal.X 1 (i) represents the revised positions of the male jackal, and X 2 (i) represent the revised positions of female jackal with respect to the prey.The position updates of the jackals are determined by Equations ( 6) and ( 12), which are utilized to compute the evading energy of the prey.To avoid getting stuck in local optima and encourage exploration, Equations ( 13) and ( 14) incorporate the function s1.The use of Equation ( 9) to compute s1 is aimed at overcoming any sluggishness towards local optima, especially in the later iterations.This factor is akin to the obstacles that jackals face while pursuing prey in their natural habitat.During the exploitation stage, s1 serves the purpose of addressing these obstacles.
To sum up, the GJO algorithm starts by creating a random prey population as a potential solution.During each iteration of the algorithm, the jackals work together to anticipate the potential location of their prey.Every individual in the population adjusts the distance between the jackal pairs according to the specified criterion.The parameter e 1 is decreased from 1.5 to 0 over time to balance exploration and exploitation.If e exceeds 1, the golden jackal pairs move farther from the prey.In contrast, if e is less than 1, the teams move closer to the prey to increase the chances of capturing it.

Feature Selection Using Golden Jackal Optimization
Feature selection refers to picking a smaller relevant subset of predictor variables from a larger dataset, aiming to enhance the accuracy of machine learning models, decrease computational expenses, and reduce the chances of overfitting.Put differently, it is a method of determining the essential features with the highest impact on the target variable.Naturally, selecting parts for classification is difficult; therefore, FSGJO uses GJO optimization to select a relevant subset of features.Using FSGJO, there is an increase in classification accuracy.Below is an explanation of the various phases of FSGJO.

Initialization
GJO is a population-based approach, similar to various other metaheuristics; the search space is uniformly explored starting from an initial or first solution.The initial solution is shown in Equation (15), where X min is lower bound, X max is upper bound, and rand() is a function whose value range between 0 to 1.
Suppose there are p preys and q variable; then, an individual can be represented as shown in Equation ( 16), where 1 ≤ i ≤ p is the index of each prey.However, the population of the prey is directly represented by a p × q matrix such that, X prey = (x ij ) p×q as shown in Equation (17), where i = 1, 2, 3, . . ., p, j = 1, 2, 3, . . ., q, and a row represents an individual prey, and a column represents a dimension (variable).X prey is the initial matrix of prey generated during initialization, where the top two fittest members are a pair of jackals (male jackal and female jackal, respectively).
The optimization process involves p preys and q variables.The position of each prey represents the parameters of a particular solution.In order to assess the performance of each candidate solution during the optimization process, a fitness function (also known as objective function) is utilized, and the output values of this function for all solutions are stored in a matrix as shown in Equation (18), where, i = 1, 2, 3, . . ., p, j = 1, 2, 3, . . ., q, and fitness values of each prey are stored in a matrix called F ij , where the notation X p,q refers the value of the p th prey on the q th dimension.The optimization involves p preys, and the objective function is denoted by F ij .The male and female jackals acquire the positions of the fittest and second fittest prey, respectively, and these are known as the male jackal and female jackal prey positions.

Exploration Phase
In GJO, exploration is achieved by simulating the movement of a golden jackal pack searching for food in an unknown territory.Each jackal (solution) moves randomly within a specific range to explore the search space.This behavior helps prevent the algorithm from being trapped in local optima and facilitates discovering new solutions.Although, occasionally, the prey cannot be easily grabbed and manages to escape, it is in the nature of the jackal to be able to perceive and track it.Thus, if the prey is not easily caught, the jackals enter the exploration stage, searching for other potential targets.During hunting, the female jackal follows behind while the male jackal takes the lead.The updated position of male jackal is shown in Equations ( 19) and (20), where variable X prey refers to the location vector of the prey, X M is the location of the male jackal, and X FM is the location of the female jackal.Variable i represents the current iteration.X a is the revised positions of the male jackal (X M ), and X b indicates the revised positions of the female jackal (X FM ) in relation to prey.The calculation of the prey's evading energy, E p , involves Equation ( 21), wherein the initial energy of the prey can be represented by E p0 , while E p1 signifies the reduction of its energy.
E p0 is calculated using Equation (22), and E p1 is calculated using Equation (23), where r, that is a random number between 0 and 1, as well as a constant value denoted as c 1 that is equal to 1.5.Additionally, the maximum number of iterations is represented by I, while i indicates the current iteration number.The decreasing energy of the prey is denoted by the variable E p1 .During the iterative process, this value decreases linearly from 1.5 to 0, indicating the gradual depletion of the prey's energy.
Equations ( 19) and ( 20) are used to calculate the distance between the jackal and its prey as X(i) − s1 * X prey (i).The energy level of the prey controls the jackal's movement, which shifts its location either higher or lower depending on how far it is from the prey.The vector s1 employed in Equations ( 19) and ( 20) is a series of random numbers that complies with the Levy distribution, which is a specific type of probability distribution.This distribution is utilized to emulate the Levy movement, and it is multiplied by the Prey vector to determine the movement of the prey in a Levy fashion.The calculation of s1 shown in Equation (24).
The Levy Flight function (LF) is a mathematical function that simulates random movements in a search space.It is commonly used in optimization algorithms as it is used here.The process involves generating random numbers from the Levy distribution and using them to update the position of the search agent.The Levy distribution is a probability distribution with heavy tails, allowing for occasional large movements.This property is helpful in optimization because it enables search agents to explore distant areas of the search space that would be difficult to reach with small, incremental movements.LF can be calculated using Equation (25), where u, v u, v are a normal distribution function with a standard deviation of σ u and σ v such that u = normal 0, σ 2 u and v = normal 0, σ 2 v .σ u is calculated using the Equation (26).
Equation ( 27) illustrates the position update of the male jackal and female jackal, which involves the averages Equations ( 19) and (20).

Exploitation Phase
The simulation imitates the hunting behaviors of a dominant male golden jackal that takes the lead and guides the pack towards the food source to exploit the prey.The harassment of the prey by the jackals gradually reduces its ability to evade, enabling the male and female jackal pair to surround the prey discovered earlier.After being contained, the jackals pounce on their target and devour it.In a mathematical model, the cooperative hunting behavior of jackals is represented in Equations ( 28) and ( 29), where i indicates the current iteration of the simulation.X prey is the location vector of the prey, while X M (i) represents the location of the male jackal and X FM (i) represents the location of female jackal.X a (i) represents the revised location of the male jackal, and X b (i) represents the revised positions of the female jackal with respect to the prey.Equation ( 21) is employed to compute the evading energy of the prey, denoted as E p .Equation ( 27) is then utilized to revise the positions of the jackals.In the exploitation phase, the function s1 is utilized in Equations ( 28) and (29) to promote exploration and prevent the algorithm from becoming trapped in local optima.Equation ( 24) is used to calculate s1, which helps to overcome sluggishness towards local optima, particularly in the final iterations.This element represents obstacles that hinder the jackals from moving towards the prey, such as those encountered in natural chasing paths.The function of s1 during the exploitation stage is to address these obstacles and facilitate the jackals' movement towards the prey.

Fitness and Transfer Function
Before computing fitness and updating it, the continuous values of the position matrix (X prey ) are converted into binary values using a transfer function.A sigmoid transfer function is used in this study, as shown in Equation (30).The reason for using this S-shaped transfer function is that it allows for a smooth and continuous transition from real-valued positions to binary values, which can help to avoid premature convergence and improve the search performance of the optimization algorithm.
In this equation, X represents the position value in the position matrix (X prey ) before being converted to binary.The sigmoid function maps the continuous value of X to a value having 0 and 1, which can then be used to determine the corresponding binary value.The purpose of this conversion is to ensure that the position values are binary and can be used to calculate the fitness value of the prey.
The fitness in this context refers to the prediction error of a machine learning (ML) classifier.It is determined by comparing the actual output of the classifier with its estimated output.To train the classifier, a 0.2 data split size is used, meaning that 20% of the data is held out for testing while the remaining 80% is used for training.The fitness is calculated using the Equation (31), where k is a value that ranges from 1 to m (the number of testing observations) and Err(k) is the prediction error for the kth observation.The summation is divided by m to obtain an average prediction error.
The algorithm maintains two variables for the updating of fitness; the variables MaleJackalscore and FemaleJackalscore represent the fitness scores of the best male and female jackals found so far during the optimization process.f itness can be assumed as old fitness; MaleJackalscore and FemaleJackalscore can be assumed as new fitness.If the f itness of a jackal is lower than the current MaleJackalscore, it means that the jackal has a better f itness than the current male jackal, and thus, its position and score will replace the current male jackal's position and score.On the other hand, if the f itness of a jackal is higher than the MaleJackalscore but lower than the FemaleJackalscore, it means that the jackal has a better fitness than the current female jackal, but not better than the male jackal.In this case, its position and score will replace the current female jackal's position and score.After the fitness calculation, the fitness stored as shown in Equation (32), where the fitness array is denoted as f i , which consists of p elements f 1 , f 2 , f 3 , . . ., f p .
In each iteration of the algorithm, a random value between −1 and 1 is assigned to the initial energy E p0 .The value of E p0 is an indicator of the prey's physical strength, where a decrease from 0 to −1 indicates a decline in the prey's strength.An elevation from 0 to 1 denotes a boost in the prey's strength, whereas a decrease in E p is observed during the iterative process, as shown in Figure 1.If the magnitude of E p is greater than 1, it means that the jackal pairs are searching for prey in different areas, which suggests that the algorithm is in an exploration phase.On the other hand, if the magnitude of E p is less than 1, the algorithm switches to an exploitation phase and starts attacking the prey (Algorithms 1).
and female jackals found so far during the optimization process. can be assumed as old fitness;  and  can be assumed as new fitness.If the  of a jackal is lower than the current , it means that the jackal has a better  than the current male jackal, and thus, its position and score will replace the current male jackal's position and score.On the other hand, if the  of a jackal is higher than the  but lower than the , it means that the jackal has a better fitness than the current female jackal, but not better than the male jackal.In this case, its position and score will replace the current female jackal's position and score.After the fitness calculation, the fitness stored as shown in Equation (32), where the fitness array is denoted as   , which consists of  elements  1 ,  2 ,  3 , … ,   .
= ( 1 ,  2 ,  3 , … ,   ) (32) In each iteration of the algorithm, a random value between −1 and 1 is assigned to the initial energy  0 .The value of  0 is an indicator of the prey's physical strength, where a decrease from 0 to −1 indicates a decline in the prey's strength.An elevation from 0 to 1 denotes a boost in the prey's strength, whereas a decrease in   is observed during the iterative process, as shown in Figure 1.If the magnitude of   is greater than 1, it means that the jackal pairs are searching for prey in different areas, which suggests that the algorithm is in an exploration phase.On the other hand, if the magnitude of   is less than 1, the algorithm switches to an exploitation phase and starts attacking the prey (Algorithms 1).Let, Male Jackal Position be X a 4.
Let, Female Jackal Position be X b 5.
Determine the preys' fitness value 6.
Using Equations ( 27)-(29) Update the prey position 17.Update Jackal Position, X(i) = X a +X b 2 18.Using transfer function to convert continuous values of X i i.e., position, in binary values using Equation ( 30 The FSGJO algorithm is an optimization algorithm in metaheuristic form that works on the hunting pattern of golden jackals.After randomly initializing a population of prey, the algorithm proceeds to search for the ideal solution through a series of iterations.The algorithm employs the idea of jackals, wherein the male jackal denotes the best solution found thus far, while the female jackal represents the second-best solution.The algorithm updates the position and evading energy of each prey based on certain equations and then performs an exploration or exploitation phase depending on the value of the evading energy.The algorithm updates the jackal position by taking the average of the male and female positions.Then it converts the continuous values of the prey positions into binary values using a transfer function.The algorithm continues for a specified repetitions and returns the male jackal position, representing the best solution found by the algorithm.The detailed explanation of FSGJO algorithm is presented in Algorithms 1.A flowchart depicting it is shown in Figure 2.

Results
This section provides information on the datasets utilized in the experiment, the experimental setup, and the analysis of the obtained results.

Results
This section provides information on the datasets utilized in the experiment, the experimental setup, and the analysis of the obtained results.

Datasets
Most of these datasets are publicly available and have been used in various software engineering and machine learning research studies.Some of these are commonly used benchmark datasets, and their sources can be found in multiple academic publications or online repositories.The PROMISE repository provides a collection of datasets for various software engineering tasks, including software fault prediction.These datasets are primarily meant for research purposes and are frequently employed to assess the efficacy of different software fault prediction models.The datasets in the PROMISE repository are sourced from various software projects and programming languages.Each dataset usually contains additional metrics and features that characterize the analyzed software, along with details on the occurrence or non-occurrence of faults in the software.Some examples of the metrics and features that are in these datasets are: • LOC (Lines of Code): This metric measures the number of lines of code in the software being analyzed.

•
Cyclomatic Complexity: This metric measures the complexity of the software's control flow and can help identify potential trouble spots.

•
Code Churn: This metric measures the software's change over time and can help identify modules or components that may be more prone to faults.

•
Code Coverage: This metric measures the extent to which the software's code has been tested and can help identify code areas that may be more likely to contain faults.

•
Halstead's Complexity Measures: These metrics measure various aspects of the complexity of the software's code, such as the number of distinct operators and operands, and can help identify potential trouble spots.
Each dataset in the PROMISE repository typically includes a description of the software project, as well as information on the available metrics and features.The datasets may also include information about the presence or absence of faults in the software, such as the number of bugs that were discovered during testing or the number of incidents that were reported by users.Researchers can use these datasets to train and test different software fault prediction models.To assess the efficacy of diverse software fault prediction approaches and discover scopes for further enhancement, researchers can analyze the performance of various models on a common dataset.Here, 12 datasets from the PROMISE repository by NASA have been used.The datasets are KC1, PC5, MC1, JM1, PC1, MW1, PC2, KC3, PC4, CM1, and MC2.Table 1 provides the specifics of the datasets.In dimensionality reduction, there are some issues with the dataset such as correlation and collinearity.Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.Correlation can be used to explore the relationship between any two quantitative variables.Feature selection can be used to deal with the correlation problem in data analysis and modeling.Feature selection is a technique that aims to select a subset of the most important features from a set of features in a dataset.By selecting only the most important features, we can reduce the impact of correlated variables and improve the performance of our models.The algorithms such as FSGE, FSPSO, FSDE, and FSACO and the proposed FSGJO algorithm used in the manuscript can effectively deal with correlation and other complex dependencies between features.Table 2 shows the correlation data for different datasets.It ranges from −1 to +1, where −1 indicates a perfect negative correlation (as one variable increases, the other decreases), +1 indicates a perfect positive correlation (as one variable increases, the other also increases), and 0 indicates no correlation between the variables.For example, the first row of Table 2 shows the correlation data for the dataset named "PC1".The average correlation value for this dataset is 0.378607, indicating a moderate positive correlation between the variables.The minimum correlation value is −0.33978, indicating some negative correlation between the variables, and the maximum correlation value is 1, indicating a perfect positive correlation between the variables.The standard deviation of correlation values for this dataset is 0.333734, indicating that the correlation values vary widely in the dataset.The value in the max column is 1, which suggests that at least one pair of variables has a perfect correlation.Similarly, each row in the table shows the correlation data for a different dataset.The correlation values can help identify patterns in the data, such as strong positive or negative correlations, weak correlations, or no correlations.This information can be useful for statistical analysis, such as identifying which variables are most strongly related to each other, or for modeling, such as using the correlation data to make predictions or develop models.Overall, the correlation data in this table provides valuable information about the relationships between variables in different datasets, which can help researchers and analysts gain insights into the data and make informed decisions based on the results.
Collinearity is a problem that occurs when two or more features in a dataset are highly correlated.In the context of feature selection, collinearity can make it difficult to identify the most important features.This is because the coefficient estimates for all of the correlated features may be large, even if only one of the features is truly independent.There are a number of ways to deal with collinearity.One way is to use a correlation matrix to identify correlated features.The correlation matrix for dataset MC1 and JM1 is shown in Figures 3 and 4, respectively.Once you have identified correlated features, you can use feature selection to remove them from the dataset.There are a number of different feature selection methods available.Feature selection can be a useful tool for dealing with collinearity.By removing correlated features from the dataset, you can improve the stability of the coefficient estimates and the accuracy of the model, as we have used the different feature selection methods such as FSGA, FSPSO, FSDE, and FSACO and the newly proposed FSGJO which does the work efficiently.

Experimental Condition
In the context of the study, the trial was conducted using VS Code and the 3.9.12version of Python.The laptop utilized for the experiment was equipped with an AMD

Experimental Condition
In the context of the study, the trial was conducted using VS Code and the 3.9.12version of Python.The laptop utilized for the experiment was equipped with an AMD

Experimental Condition
In the context of the study, the trial was conducted using VS Code and the 3.9.12version of Python.The laptop utilized for the experiment was equipped with an AMD Ryzen 75,000 series processor, a clock speed of 1.80 GHz, 16 GB of RAM, and AMD Radeon graphics.The parameters of the experiment such as β were set to 1.5; population size was set to 30, and the maximum iteration was set to 200.

Experimental Analysis
To predict software faults, 12 different datasets that were used are listed in Table 2.The classifiers employed for the experiments are DT, KNN, NB, and QDA.Then, randomly split the datasets into training and testing sets, maintaining a ratio of 80:20, respectively.To maintain the consistency of each algorithm's performance, we conducted 10 runs of the experiments.The average accuracy results obtained from FSGJO and the other FS models are presented in Table 3.The performance of the novel FSGJO algorithm is evaluated against various FS techniques, including FSPSO, FSGA, FSACO, and FSDE, using a set of 12 datasets obtained from NASA's open repository.Table 3 provides the performance comparison of various feature selection techniques on different datasets using different classifiers.The table represents the average classification accuracy of various classifiers applied to different datasets with and without feature selection.In addition, the table also presents the mean number of features chosen by each FS technique.The classifiers were tested on various datasets using diverse feature selection methods mentioned previously.From the results, FSGJO has performed well in most of the datasets except for a few of the datasets.For the PC4 dataset, the highest accuracy has been achieved by the FSGA model, but the difference between the accuracy of both the FSGJO and FSGA models is very minor.For the MC1 and MC2 datasets, the average accuracy of FSGA, FSDE, and FSPSO for the QDA classifier is the same.Similarly, for other datasets the accuracy of models is somewhat the same and somewhat different, less or more with each other.However, for the majority of the cases FSGJO has greater average accuracy.
The fitness error plot of the four classifiers-DT, KNN, NB, and QDA-is shown in Figures 5-8, respectively.It includes the error plots of each FS model-FSGA, FSPSO, FSDE, FSACO, and FSGJO.Each figure contains the plots for all the 12 datasets.From the error plot, it can be seen that for many times the plot for FSGJO is lower, but for some it coincides with other FS models, and for some it is above the other.In Figure 5 (DT classifier), the fitness plots of FSGA and FSGJO coincide with each other at 165 iterations in the PC3 dataset.It is similar for FSACO and FSGJO in the MW1 dataset after 145 iterations.In the KNN classifier, the coincidences of the error plot occur in CM1 and PC2 between FSACO, FSDE, and FSGJO after 50 and 150 iterations, respectively.In the NB classifier, the coincidences of the error plot occur in the KC3 dataset between FSACO and FSDE, in the MC2 and MC1 dataset between FSACO and FSGJO, and in the PC3 dataset between FSACO and FSGJO for some number of iterations.Lastly, in the QDA classifier, the coincidences of the error plot occurred in MC1 datasets for all classifiers in the MC2 dataset between FSACO, FSDE, and FSGJO.For the rest of the other datasets and classifiers, the error plot of FSGJO is less than the others which shows that FSGJO performs better than the other models.The parameters utilized in various FS models are listed in Table 4.The algorithms utilized in the study employ different parameters.The number of populations is set to 30 for all models, and the max number of iterations is 200.The value of δ is 1.5 in GJO.To specify the parameters for ACO, the values of alpha, beta, and rho are set to 1, 0.1, and 0.2, respectively.In GA, the mutation rate (MR) utilized is 0.01, and the crossover rate (CR) is 0.8.The utilization of 0.9 as the crossover rate (CR) and 0.8 as the scaling factor (SF) is common in Differential Evolution (DE).The initial weights W min and W max in PSO are set to 0.4 and 0.9, respectively.

Statistical Analysis
Statistical analysis [51] is an important component of machine learning (ML), as it helps to make sense of data by identifying patterns and relationships.By using the same parameters, the proposed model can be compared with other models in terms of their performance.Some common statistical techniques used in ML include regression analysis, cluster analysis, principal component analysis (PCA), hypothesis testing, and Bayesian analysis.Overall, statistical analysis is an essential tool for understanding and making sense of data in machine learning.In statistical hypothesis testing, there are two main types of tests: parametric (using parameters) and nonparametric (using a hypothesis).Parametric statistical testing is a type of statistical analysis that assumes that the data being analyzed follows a particular probability distribution, most commonly the normal distribution.This assumption allows for the use of a range of statistical tests that are more powerful than their non-parametric counterparts.The use of parametric tests requires that several assumptions are met, including the assumptions of normality, homogeneity of variance, and independence of observations.When conducting a parametric test, it is crucial to ensure that the assumptions are met, as any violations of these assumptions can lead to inaccurate outcomes and conclusions that are not valid.Therefore, it is imperative to verify these assumptions beforehand and to consider utilizing non-parametric tests if they are not met.Non-parametric statistical testing is a form of statistical analysis that does not rely on a particular probability distribution assumption for the data being analyzed.Instead, non-parametric tests are based on the ranks or orderings of the data, making them more robust to violations of assumptions and more widely applicable than parametric tests.Non-parametric tests do not require assumptions of normality, homogeneity of variance, or independence of observations, making them more versatile than parametric tests.However, they may be less powerful than parametric tests when the assumptions of the parametric tests are met.Selecting an appropriate statistical test that is tailored to the research question and the characteristics of the data being analyzed is crucial.Here, the Friedman test has been used, which is a non-parametric statistical test that is used to compare three or more related groups.
The Friedman test is utilized to determine if there are any noteworthy disparities between groups by comparing their average ranks.The test's null hypothesis suggests that there are no differences between the groups, while the alternative hypothesis proposes that at least one group displays a significant difference from the others.The test statistic used in the Friedman test is based on the chi-squared distribution, and the significance level is determined using the appropriate critical values from the chi-squared distribution table .In cases where the computed test statistic is higher than the critical value, the null hypothesis is refuted, signifying a significant distinction between the groups.One important consideration when using the Friedman test is that it is an omnibus test, meaning that it only determines whether there is a significant difference between the groups as a whole.Additional tests, such as post hoc analyses, may be necessary to identify the particular groups that exhibit significant differences from one another.
In hypothesis testing, the null hypothesis (H 0 ) postulates that there is no significant disparity between the models, whereas the alternative hypothesis (H 1 ) proposes the contrary.When the p-value is less than the significance level, it means there is a difference between two or more models, and we can reject the null hypothesis.The Friedman test assigns a rank to each model based on its classification performance in the experiment.The models are ranked from the one with the lowest number to the one with the highest number, with the highest rank assigned to the one with the lowest number and the lowest rank assigned to the one with the highest number.Table 5 presents the results of evaluating various models (FSACO, FSDE, FSGA, FSGJO, FSPSO, and Without FS) with different classifiers (KNN, DT, NB, and QDA), showing the AvgRankModels calculated with Equation (33).Table 5 presents the AvgRankDatasets obtained by evaluating the mean of all ranks for each associated model (Without FS, FSGA, FSPSO, FSDE, FSACO, and FSGJO) across all datasets using Equation (34).The computation of AvgRankDatasets involves adding up the ranks of all classification models used and dividing the sum by the total number of models.In Table 6, we report the average ranks of the feature selection (FS) models employed in our experiment.To calculate the average rank performance of a group of models and datasets, add up the mean rank of each one and divide by the total number of models.
AvgRankDatasets = AvgRankModels total no.o f datasets (34)   The computation of X 2 F in Equation ( 35) involves utilizing the AvgRankModels, which is determined to be 15.93.The variables M and N are used to represent the number of datasets and models in the experiment, respectively.The resulting value for the Friedman statistic, F F , is calculated as 3.98 based on Equation (36).The analysis in this instance is conducted using 12 datasets and 6 models.The critical value is calculated as 2.449 with (6 − 1) and (6 − 1) × (12 − 1) degrees of freedom, and the significance level of α is 0.05.
The density plot in Figure 9, with a degree of freedom of (5,55), shows the critical value of 2.269.The fact that the Friedman Statistics (F F = 3.98) is higher than the critical value allows us to reject the null hypothesis (H 0 ).This implies that there is a noteworthy difference between at least two models.The computation of   2 in Equation ( 35) involves utilizing the  , which is determined to be 15.93.The variables M and N are used to represent the number of datasets and models in the experiment, respectively.The resulting value for the Friedman statistic,   , is calculated as 3.98 based on Equation (36).The analysis in this instance is conducted using 12 datasets and 6 models.The critical value is calculated as 2.449 with (6 − 1) and (6 − 1) × (12 − 1) degrees of freedom, and the significance level of α is 0.05.The density plot in Figure 9, with a degree of freedom of (5,55), shows the critical value of 2.269.The fact that the Friedman Statistics (  = 3.98) is higher than the critical value allows us to reject the null hypothesis ( 0 ).This implies that there is a noteworthy difference between at least two models.Upon rejecting the null hypothesis in the Friedman test, which implies the presence of variations among multiple models, it is standard practice to conduct a post hoc test to identify the specific models that exhibit significant differences.There are several post hoc Upon rejecting the null hypothesis in the Friedman test, which implies the presence of variations among multiple models, it is standard practice to conduct a post hoc test to identify the specific models that exhibit significant differences.There are several post hoc tests that can be used, such as the Wilcoxon signed-rank test, the Holm-Bonferroni method, and the Nemenyi test.The choice of the post hoc test depends on the specific research question and the characteristics of the data.The main goal of the post hoc test is to provide more detailed information about the differences between the models and to identify which models are significantly better than others.Additionally, it has been shown to have more statistical power than other methods while maintaining an acceptable type I error rate.Controlling the Type I error rate is important in statistical inference because it helps ensure that the results drawn from the data are accurate and reliable.The Holm procedure [52][53][54][55][56] is a multiple comparison procedure that can be used as a post hoc test after conducting a statistical test like the Friedman test.It uses the p-value and z-value to evaluate the performance of each individual.The calculation of z is performed using Equation (37), and then the corresponding p-value is obtained from the normal distribution table.
The value of z in this context refers to the z-score value, which is calculated using a formula represented by Equation (37).In the formula, N and M represent the quantity of models and datasets used in the experiment, respectively.The average rank of the x th and y th models is symbolized as AR x and AR y respectively.Table 7 shows the comparison of all the models using the z-value, p-value, and (α/N-i).The significance level used in the assessment is 0.05, denoted by α.Table 7 displays the outcomes of a Holm test carried out on five FS (Feature Selection) models: FSGJO:WFS, FSGJO:FSGA, FSGJO:FSPSO, FSGJO:DE, and FSGJO:ACO.The table presents the alpha/v-I, the p-value, and the z-value for each model.The adjusted alpha level (α/N-i) was calculated based on their ranks and the number of models.The adjusted alpha level is the significance level adjusted for the multiple comparisons in the test.It is used to determine if a result is statistically significant after adjusting for the number of tests conducted.The table indicates that, for the most part, the p-values are lower than or equal to the adjusted alpha level (α/N-i), except for the FSGJO and FSGA models.These findings indicate that, with the exception of the FSGA model, the FSGJO model exhibits superior and statistically significant results compared to the other models.According to the table, the FSGJO model has noteworthy outcomes and performs better than the other models, except for the FSGA model.There is no statistical significance in the differences of performance among these models.

Conclusions
In this study, a novel method for feature selection called FSGJO is introduced, which employs metaheuristic optimization using the GJO algorithm to efficiently identify the optimal set of features.The FSGJO feature selection technique strives to choose the most significant features within the solution space, aiming to exclude redundant and irrelevant ones.This research assesses the efficiency of the FSGJO method on 12 different datasets using four classifiers (DT, KNN, NB, and QDA).The primary objective is to compare FSGJO's performance with that of existing feature selection models such as FSPSO, FSGA, FSDE, and FSACO, which have different benchmark dimensions.Statistical analysis using the Friedman test indicated that at least two models differed significantly from one another, and the null hypothesis was rejected, which led to the Holm test.Based on the results, it was found that FSGJO displayed a better performance compared to other methods for selecting features, both in relation to accurately classifying data and efficiently eliminating features that were not useful.The advantage of GJO is its ability to handle high-dimensional optimization problems with multiple objectives and avoid local optima by combining exploratory and exploitative search strategies.The only limitation of the proposed model is that its parameters must be adjusted according to the given problem.The proposed method can be applied to other fields such as medical data and gene data.

Mathematics 2023 ,
11,  x FOR PEER REVIEW 15 of 29 collinearity.By removing correlated features from the dataset, you can improve the stability of the coefficient estimates and the accuracy of the model, as we have used the different feature selection methods such as FSGA, FSPSO, FSDE, and FSACO and the newly proposed FSGJO which does the work efficiently.

Figure 5 .
Figure 5. Fitness Error Plot for DT.

Figure 6 .
Figure 6.Fitness Error Plot for KNN.Figure 6. Fitness Error Plot for KNN.

Figure 6 .
Figure 6.Fitness Error Plot for KNN.Figure 6. Fitness Error Plot for KNN.

Figure 7 .
Figure 7. Fitness Error Plot for NB.Figure 7. Fitness Error Plot for NB.

Figure 7 .
Figure 7. Fitness Error Plot for NB.Figure 7. Fitness Error Plot for NB.

Table 1 .
Detail of the datasets.

Table 2 .
Correlation Data for different Datasets.

Table 3 .
Comparison between different FS Algorithms.

Table 4 .
Parameters used in different FS models.

Table 5 .
FS Algorithm ranks for 12 datasets using the Friedman Test.

Table 7 .
Test results of Holm method.