Code Smell Detection Using Ensemble Machine Learning Algorithms

: Code smells are the result of not following software engineering principles during software development, especially in the design and coding phase. It leads to low maintainability. To evaluate the quality of software and its maintainability, code smell detection can be helpful. Many machine learning algorithms are being used to detect code smells. In this study, we applied ﬁve ensemble machine learning and two deep learning algorithms to detect code smells. Four code smell datasets were analyzed: the Data class, the God class, the Feature-envy, and the Long-method datasets. In previous works, machine learning and stacking ensemble learning algorithms were applied to this dataset and the results found were acceptable, but there is scope of improvement. A class balancing technique (SMOTE) was applied to handle the class imbalance problem in the datasets. The Chi-square feature extraction technique was applied to select the more relevant features in each dataset. All ﬁve algorithms obtained the highest accuracy—100% for the Long-method dataset with the different selected sets of metrics, and the poorest accuracy, 91.45%, was achieved by the Max voting method for the Feature-envy dataset for the selected twelve sets of metrics.


Introduction
Code smells indicate poor design and implementation options that may reduce code understandability and probably enhance change requirements and fault proneness [1].Thus, a code smell is a characteristic in a program's source code that indicates a bigger issue.Code smells happen when code is not written according toessential principles [2].Software engineers face a difficult task in detecting code smells.W. Kessentini et al. [3] and Fontana et al. [4,5] discussed various techniques and tools to identify various code smells.Every approach has a different outcome [6,7].
Because of the large size and high complexity of the software, the quality is declining [8].Designers must follow the development cycle, as well as specific and non-specific requirements, to ensure software reliability [9].Commonly, developers have focused only on functional requirements, whereas non-functional requirements, such as conciseness, reliability, progression, manageability, and renewability, are ignored [10].The absence of non-functional requirements tends towards degradation of software reliability, which increases the maintainability cost and software's complexity.Fowler et al. [11] describe how to transform poorly built code in the exemplary execution using a refactoring paradigm.
Various experiments have been carried out to investigate the effects of code smells on software, and unwanted consequences of these have been found [12][13][14].Software restructuring is generally required to eliminate them.Olbrich et al. [15], Khomh et al. [16], and Deligiannis et al. [17] analyzed the impact of code smells upon software design by studying the number of modifications required, in future, within the software.They also checked whether classes affected with code smells need to be altered more frequently and need too much looking after.Li, W et al. [18] considered the influence of bad smells on class failure possibility in the future.Their research found that affectedsoftware modules that have code smells have a greater rate of failure in comparison with other modules.Castillo et al. [19] investigated the negative impact of the God class (GC) on utilization and discovered that removing the GC reduces the cyclic complexity of the software.Gogullothu et al. [20] worked on multi-label code smell datasets.Tomasz et al. [21] carried out a systematic literature review to see how far there is reproducible research on code smell detection.
This research focuses on detecting code smells using ensemble machine learning and deep learning approaches with the software metrics.Metrics have an important role in code smell detection by determining the source code's characteristics.

Motivation
In the literature [7,[20][21][22], many machine learning techniques (MLTs) and feature selection approaches (FSA) have been applied to code smell datasets for detecting code smells [7].Moreover, the results of most of these techniques are found to be good [20,22,23].However, in most of the papers [7,[20][21][22] the authors do not mention the effect of different subsets of metrics on the performance accuracy for detecting the code smells.To fill this gap, in this paper we extracted different subsets of metrics (e.g., eight metrics, nine metrics, ten metrics, eleven metrics, twelve metrics, and the whole set of metrics) with the help of Chi-square FSA; the ensemble machine learning and deep learning algorithms were then applied to each set of metrics to find their effects on the model's accuracy.

Contributions
This study introduces a code smell detection technique based on ensemble machine learning and deep learning approaches.We considered four code smell datasets: God Class (GC), Data Class (DC), Long-method (LM), and Feature-envy (FE) from Fontana et al. [7].The GC and DC datasets are the class-level datasets, while FE and LM datasets are the method-level datasets.Five ensemble learning algorithms (Adaboost, Bagging, Max voting, Gradient boosting, and XGboosting) and two deep learning algorithms (Artificial neural network, and Convolutional neural network) were implemented on these datasets.
Seven performance measures: sensitivity, accuracy, positive predictive value (PPV), Fmeasure, area-under-curve of receiver-operating-characteristic curve score (AUC_ROC_Score), Matthews correlation coefficient (MCC), and the Cohen_Kappa_scorewere calculated to evaluate the performance of ensemble methods.

Research Questions
This study has the following research questions.
RQ1. Which ensemble and deep learning algorithm is better/best for detecting the code smells?Motivation.Alazba, A et al. [22] applied machine learning and stacking ensemble algorithms, and Tushar Sharma et al. [24] applied deep learning for code smell detection.They found that the stacking ensemble and deep learning algorithms obtained better performance accuracy than the MLTs.For that reason, we examine the impact of other ensemble learning and deep learning algorithms to detect code smells.RQ2.Does a set of metrics chosen by the Chi-square FSA improve the performance of code smell detection?
Motivation.Mohammad Y. Mhawish et al. [25,26], Pushpalatha M.N. [27], and Dewangan et al. [23] presented the impact of different FSAs on the performance measurements.They found that using the FSA improved the accuracy, although these authors did not examine the effect of various subsets of metrics on the algorithms' performance.Therefore, in this study, the Chi-square FSA is applied to improve the algorithms' performance and identify which subset of metrics plays a better role in the code smell detection procedure.

RQ3. Does the SMOTE class balancing technique improve the performance of code smell detection?
Motivation.Sushant Kumar Pandey et al. [28] applied a random sampling method to solve the class imbalance issue.They found an improvement in the results with the applied random sampling method.SofienBoutaib et al. [29] applied an ADIODE method to identify the code smell with class labels.They found good results.It motivated us to apply the SMOTE method to find the impact of the class imbalance problem in our study.
The outline of the paper is as follows: Section 2 describes the literature review.Section 3 describes the used datasets and research framework.Section 4 depicts the implementation work.Section 5 discusses the result analysis and threats to validity.Section 6 concludes the study.

Literature Review
Various approaches have been introduced in the literature for code smell detection.Fontana et al. [30] proposed an MLT to classify code smell severity.This method can assist developers in ordering classes or functions.The code smell severity is classified using a multinomial classification and regression method.M.N.Pushpalatha et al. [27] proposed the bug's severity reports prediction for closed source datasets.For this purpose, they took the dataset (PITS) for NASA projects from the PROMISE warehouse.They applied ensemble approaches and two-dimensional reduction techniques (Chi-square and information gain), to improve the accuracy.They obtained the results that performance of the bagging approach is better in comparison to other ensemble algorithms.
Aladdin et al. [31] presented eight MLTs to calculate the severity level of a software bug report in closed source projects.These bug reports are associated with various closedsource projects evolved by the INTIX company based in Amman, Jordan.They built their dataset from the JIRA bug tracking system.They found that the decision tree algorithm achieved better performances than other MLTs.
Pushpalatha et al. [32] presented ensemble algorithms using supervised and unsupervised classification for bug severity reports for closed source datasets.They used information gain and Chi-square FSA to select the appropriate features from the severity dataset.They obtained 79.85% to 89.80% varied accuracy for Pits C.
As we have seen, most of the articles above mainly describe code smell recognition with MLTs.Most of the previous studies examined only a few systems and applied them to the MLT.Some authors applied parameter optimization approaches and various kinds of FSAs.Dewangan et al. [23] applied six MLTs, a tuning optimization approach based on grid search, wrapper-based and Chi-square FSA to select the appropriate features from each dataset and obtained 100% accuracy with the logistic regression model on the LM dataset, but the accuracy of other datasets i.e., DC, GC, and FE, was not good.
Table 1 summarizes different kinds of tools and methods used to detect the code smell.Table 1 discussed two types of code smell datasets: simple code smell and severity code smell datasets.In Table 1, the works of literature [5,7,20,25,26,36,37] used the same dataset: Fontana et al. [7] that is Data class, God class, Feature envy, and Long method.

Proposed Research Framework
In this work, we build a model for detecting the code smells using ensemble methods.
Steps of this framework are shown in Figure 1.First, we selected the code smell datasets [7].Then we applied the min-max normalization technique for feature scaling.After that, we applied a SMOTE class balancing technique.Then, we applied the Chi-square FSA to extract the finest features from the datasets.Ensemble and deep learning methods were applied to them.To improve the performance of ensemble and deep learning methods, we applied ten-fold cross-validation.Finally, we computed performance measures.

Dataset Choice and Illustration
The previous literature [20,22,36] used a code smell dataset from Fontana et al. [7] and obtained the best accuracy.They also examined systems from the Qualitas Corpus [42], release 20120401r, one of the most comprehensive compiled benchmark datasets to date, explicitly created for empirical software engineering research.Therefore, to conduct the experiment, we used four code smell datasets: DC, GC, FE, and LM [7].Fontana et al. [7] selected 74 systems out of 111 of various dimensions and computed a large set of objectoriented metrics.For the 74 software systems, they calculated 61 metrics for class-level code smells (DC and GC) and 82 metrics for method-level code smells (FE and LM).They used various tools and approaches to detect code smells.Table 2 explains the automatic detection tools and techniques they used.Each dataset they created has 140 smells and 280 no-smells.[23].

Code Smells
Reference, Tool/Detection Rules GC iPlasma (GC, Brain Class), PMD [43] DC iPlasma, Fluid Tool [44], Anti-Pattern Scanner [15] FE iPlasma, Fluid Tool [44] LM iPlasma (Brain Method), PMD, Marinescu detection rule [45] Table 3 shows the class level and method level code smells datasets.Sixty-one metrics are computed for DC and GC at the class level code smells.Eighty-two metrics are computed for FE and LM at the method level code smells.These datasets can be downloaded from http://essere.disco.unimib.it/reverse/MLCSD.html(accessed on 2 August 2022).
Table 3. Class level and method level code smells datasets [23].The code smells datasets are defined below.DC: Classes that do not have enough functionality are called data classes.It refers to those classes that keep data with simple functionality and have other classes that strongly depend on them.It exposes data through accessor methods [30].

Code Smell
GC: It refers to those classes that have many functionalities.It can be referred to as a huge class with a large number of lines.It causes problems connected with big code size, coupling, and complexity [30].
FE: These methods use a lot of data from other classes rather than theirs.They prefer to use the features of other classes, taking into account features entered via accessor methods [30].
LM: These methods are the results of a human tendency to write a new code instead of reading an existing code.An LM has an excessive amount of code, is complex, tough to recognize, and makes extensive use of data from other classes [30].

Dataset Normalization
These datasets have different features ranges, so it would be better to normalize the features before applying MLTs.In this paper, we applied the min-max feature scaling technique to rescale the range of feature or observation values of datasets between 0-1 [46].Equation (1) shows the min-max formula, where X is the initial real rate while X is the normalized rate.The Xmin value of the feature is changed into a "0", and the Xmax value is changed into a "1", and every other value is changed into a decimal between 0 and 1.

Class Balancing Technique
In this study, we applied the SMOTE class balancing technique to balance each class of each dataset.SMOTE is a famous oversampling approach that was introduced to enhance random oversampling.

Feature Selection Approach (FSA)
FSA is used to find the most significant features on which a response is highly dependent.It is one of the most important pre-processing steps in machine learning and is applied before applying a classification algorithm so that its performance can be improved.We used a Chi-square-based FSA to extract the best metrics to build our ensemble learning models.
Chi-square FSA is generally applied in the categorical dataset.Chi-square helps in selecting the best features by testing the relationship between features.The Chi-square formula is shown in Equation (2): For the response and independent variables, we can obtain Observed frequency (the number of observations of feature) and Expected frequency (the number of expected observations of a feature).Chi-square measures how these two values deviate from each other.The greater the deviation, the greater the response, and independent variables are dependent.
We extracted the best metrics from each dataset by which we obtained the highest accuracy.The set of metrics (e.g., 8, 9, 10, 11, 12, and all features) were extracted for each model and each dataset and we only selected those with the highest scoring features by the Chi-square FSA.Table 4 shows these metrics extracted by the Chi-square FSA, where the first feature is highly scored, and the last feature is a minimum scored feature (according to the Chi-square FSA).The detailed description of all selected metrics are given in Appendix A section.

Proposed Ensemble and Deep Learning Algorithms
AdaBoost: It was discovered by Yoav Freund and Robert Schapire.AdaBoost was the first popular boosting method for binary classification.The Boosting method combines a multiple "weak classifier" into a single "strong classifier" [47].
Bagging: Bagging, also known as Bootstrap aggregation, is a type of ensemble MLT that makes it easier to improve the MLT performance and accuracy.It reduces the error of a prediction model by applying bias-variance trade-offs to an agreement.Bagging is used in regression and classification models to prevent data from being over-fit [48].
Max Voting: The Max Voting method is an MLT which uses a set of ensemble methods and produces the outcomes (class) according to the class with the highest possibility.It basically sums the outcomes from each classifier submitted to a voting classifier and forecasts the result class based on the highest number of votes.Generally, a single model is developed which educates on numerous models and guesses outputs based on the collective number of votes for every output class, instead of building distinct single models and judging their performance [49].
Gradient boosting: The most effective ensemble MLT is the gradient boosting (GB) algorithm.Bias error and variance error are the two most common forms of mistakes in MLTs.The GB algorithm is a boosting method that may be used to reduce the algorithm's bias inaccuracy.The GB method is employed not just for constant target variables such as regression, but also for categorical target variables such as classifiers.The mean square error (MSE) is the cost function when used as a regression, while the log loss is the cost function when used as a classifier [50].
XGBoost: The XGBoost is also identified as the extreme gradient boosting algorithm.It is a tree-based MLA with better presentation and speediness.XGBoost was created by Tianqi Chen and is controlled mainly by the DMLC (Distributed Machine Learning Community) group.It has gained popularity while yielding desirable results in structured and tabular data.
Artificial neural network (ANN): The ANN, also known as a neural network (NN), is a mathematical model that draws on features of biological neural networks, including their structure and functionality.A neural network uses an artificial neural method of computation to process data and is made up of a network of artificial neurons linked to one another [51].The ANN has three layers: an input layer, a hidden layer, and an output layer., as shown in Figure 2.

Convolutional neural network (CNN):
One of the best-known and most frequently utilized deep learning methods is the convolutional neural network (CNN).CNN's key benefit over its forerunners is that it recognizes important elements without human intervention, making it popular [53].

Evaluation Methodology
As the datasets are small, we used ten-fold cross-validation in order to obtain better model performance.Figure 3 shows all the processes of the ten-fold cross-validation.

Key Measurements of Performance
In this study, we evaluated the performance of various experiments.The confusion matrix (CM) was calculated.The actual and expected information detected by code smell detection classifiers was stored in the CM.Then using the CM, the true positive (TP), true negative (TN), false positive (FP), and false-negative (FN) were calculated.The definition of TP, TN, FP, and FN are given below:

•
TP represents the outputs (occurrences) where the algorithm accurately expects the positive class.

•
TN represents the outputs (occurrences) where the algorithm accurately expects the negative class.

•
FP represents the outputs (occurrences) where the algorithm inaccurately expects the positive class.

•
FN represents the outputs (occurrences) where the algorithm inaccurately expects the negative class.
Definitions and formula of evaluation metrics: PPV, sensitivity, F-measure, AUC_ROC_score, accuracy, MCC, and Cohen_Kappa_score used to evaluate the model's performance are given below: Positive predictive value (PPV):Positive predictive value measures the number of code smell instances correctly identified by machine learning methods.PPV is also known as precision [55].Formula (3) was used to calculate the positive predictive value.PPV is calculated as the number of TP divided by the total number of TP and FP.

Positive Predictive Value
Sensitivity: Sensitivity measures the amount of code smell occurrences recognized by machine learning methods.The sensitivity is also known as the true positive rate (TPR) and Recall [55].Formula (4) was used to calculate the sensitivity.The sensitivity is calculated as the number of TP divided by the total number of TP and FN.
F-measure: F-measure measures the harmonic mean of positive predictive value (PPV) and sensitivity, and its stand for a balance between their values [55].Formula (5) was used to calculate the F-measure.
AUC_ROC_Score: The AUC_ROC_score is applied to observe the performance of a classification model based on its rate of correct and incorrect classifications.ROC represents the probability, and AUC calculates the degree of separability [55].It says how much the model is able to distinguish between classes.An outstanding model puts AUC near one, which notifies that it has a good measure of separation.A bad model will have an AUC near 0, notifying that it has the worst measure of separation.Indeed, it means it returns the result and calculates 0s as 1s and 1s as 0s.When an AUC is 0.5, the model has no class separation ability in the model.
Accuracy: Accuracy measures the association between PPV and sensitivity.It displays the percentage of positive and negative instances that were accurately categorized [55].Formula (6) was used to calculate accuracy.Accuracy is calculated as the total number of TP and TN divided by the total number of TP, TN, FP, and FN.
Matthews Correlation Coefficient (MCC): The Matthews correlation coefficient (MCC) is utilized in MLT to determine the quality of two-class or binary classification.It obtains true and false positives and negatives and is normally considered a balanced calculation that can be utilized if the classes are of significantly different sizes [56].The formula for calculating the MCC is given in Equation (7).
Cohen_Kappa_Score: Cohen_Kappais a metric which is utilized to evaluate the agreement between two raters.It can also be utilized to evaluate the performance of a classification model [57].The formula for calculating the Cohen_Kappa is given in Equation (8).

The Outcome of Proposed Algorithms
To answer RQ1, we implemented five ensemble and two deep learning algorithms and found the performance accuracy of each algorithm.Additionally, a Chi-square FSA was applied to select the best metrics from each dataset.The best metrics chosen by the Chi-square FSA are shown in Table 4.All experimental findings of each ensemble and deep learning method are presented in Tables 5-11.In each experiment table, we have shown seven performance measurements: PPV, Sensitivity, F-measure, AUC_ROC_score, Accuracy, MCC, and Cohen_Kappa_score.The performance comparison of all five ensemble and two deep learning algorithms is shown in Table 12.Note: F-F-score, AUC-AUC_ROC_Score, A-Accuracy.

PerformanceComparisonbetweenFive Ensemble and Two Deep Learning Methods
This section compares the outcomes of all five used ensemble and two deep learning techniques applied on ten features selected by the Chi-square FSAs.Table 12 shows comparative performance of all applied ensemble and two deep learning techniques using the AUC ROC Score, F-measure, and Accuracy.From Table 12 it isclear that the AdaBoost approach obtains the highest accuracy of 100% for the FE and LM datasets, while the worst accuracy of 97.62% is for the GC dataset.The Bagging algorithm obtains the highest accuracy of 100% for the FE dataset and the worst accuracy of 97.62% for the GC dataset.The Max voting approach achievedthe best accuracy, 98.81%, for the DC dataset and the worst accuracy, 97.62%, for the GC dataset.The Gradient boosting algorithm obtainedthe highest accuracy of 100% for the LM data set and the worst accuracy of 95.74% for the FE dataset.The XGBoost approach obtainedthe highest accuracy of 100% for the FE and LM datasets, while the worst accuracy was 97.62% for the GC dataset.The ANN approach obtainedthe highest accuracy of 98.25% for the LM datasets, while the worst accuracy was 97.23% for the GC dataset.The CNN approach obtainedthe highest accuracy of 99.26% for the LM datasets, while the worst accuracy was 97.23% for the GC dataset.

Effect of Subset of Features Selected by Chi-Square FSA on Model Accuracy
Chi-square FSAs were used to answer RQ2.This experiment was done to see the effect of Chi-square FSA to identify the software features which are important in recognizing the code smells.Table 13 shows how the performance accuracy of the ensemble and deep learning approaches is affected when the number of selected metrics isincreased by one in each step.The comparison indicates that the feature extraction is helpful to improve the accuracy of nearly all ensemble methods for all datasets, and that it has slightly different effects on each model and each dataset.However, in some models, such as Bagging, Gradient boosting, XGBoost algorithms, ANN, and CNN, the FSA increases accuracy greatly.In Adaboosting and the Max voting algorithm, feature extraction had no significant effect.

Effect of Class Balancing Technique (SMOTE) on Model Accuracy
A class balancing technique (SMOTE) was used to answer RQ3.This experiment was performed to see the effect of SMOTE on the accuracy performance of code smell detection.Table 14 shows how the SMOTE class balancing technique's performance accuracy affects each ensemble and deep learning method's performance accuracy for each code smell dataset.The comparison indicates that the SMOTE is helpful in improving the accuracy of some models such asAdaBoost, GB, XGBoost, and ANN models for the DC dataset.Likewise, the AdaBoost, Max voting, GB, and ANN model enhance the accuracy of the GC dataset.

Result Comparison of Our Approach with Others' Correlated Work
A few other authors [22,30,31,33,42] also worked on the same code smell datasets.These authors used machine learning and stack ensemble learning algorithms.In this subsection, we compared the outcomes of our approach with previous related works.They are shown in Table 15.Dewangan et al. [23] achieved 99.74% accuracy with the RF technique employing all features on the DC dataset.Fontana et al.'s [7] approach applied human-understandable detection rules for J48 and JRip algorithms and found 99.02% highest accuracy for the B-J48 Pruned approach.Mhawish et al. [25] used the GA-based FSA and found the highest accuracy, 99.70%, using the RF approach.Nucci et al. [36] applied the gain ratio FSA and found around 83% accuracy with the RF and J48 approach.Alazba et al. [22] used the gain FSA and found the highest accuracy, 98.92%, using the Stack-LR algorithm, whereas this approach's accuracy was 100% using the Max voting algorithm with nine features.
Dewangan et al. [23] achieved 98.21% accuracy utilizing the RF method with Chisquare FSA on the GC dataset.Fontana et al. [7] applied human-understandable detection rules for the J48 and JRip algorithms and found 97.55% highest accuracy in the Naive Bayes algorithm.Mhawish et al. [25] used the GA-based FSA and found the highest accuracy, 98.48%, using the GBT model.Nucci et al. [36] applied the gain ratio FSA and found around 83% accuracy with the RF and J48 approach.Alazba et al. [22] used the gain FSA and found the highest accuracy, 97%, using the Stack-SVM algorithm, whereas this experiment's accuracy was 99.24% with the Bagging approach using 12 features.
Dewangan et al. [23] used the Decision tree algorithm with all features and achieved 98.60% accuracy on the FE dataset.Fontana et al.'s [7] approach applied human-understandable detection rules for the J48 and JRip algorithms and found 96.64% greatest accuracy with the B-JRip approach.Mhawish et al. [25] used the GA-based FSA and found the greatest accuracy of 97.97% with the Decision tree approach.Nucci et al. [36] applied the gain ratio FSA, and obtained accuracy of around 84% with the RF and J48 approach.Alazba et al. [22] used the gain FSA and found the greatest accuracy, 95.38%, using the Stack-LR algorithm.Guggulothu et al. [20] converted the dataset into the multi-label dataset, and found the greatest accuracy of 99.10% with the B-J48 Pruned approach.Whereas this approach's accuracy was100% using all five algorithms (where the AdaBoost model used eight, nine, 10, 12 and all features, the Bagging model used10 and 12 features, the Max voting model used11 and all features, the GB model used eight and nine features, and the XGB model used 10 features).
Dewangan et al. [23] used the Logistic regression approach using all features and achieved 100% accuracy on the LM dataset.Fontana et al.'s [7] approach applied humanunderstandable detection rules for the J48 and JRip algorithm and found 99.43% greatest accuracy using the B-J48 pruned approach.Mhawish et al. [25] used the GA-based FSA and found the greatest accuracy of 95.97% with the RF approach.Nucci et al. [36] applied the gain ratio FSA and found 82% accuracy with the J48 and RF approaches.Guggulothu et al. [20] converted the dataset into the multi-label dataset, and they found the greatest accuracy of 95.90% with the RF algorithm.Alazba et al. [22] used the gain FSA and found the greatest accuracy, 99.24%, using the Stack-SVM algorithm.Whereas this approach's accuracy was 100% using all five algorithms (where the AdaBoost model used eight, nine, 10, 12 and all features, the Bagging model used11, 12 and all features, the Max voting model used nine and all features, the GB model used eight, nine, 10, 11, 12, and all features, and the XGB model used eight, nine, and 10 features).

Analysis of Our Work
In this paper, we have mainly focused on ensemble, deep learning algorithms, the SMOTE balancing technique, and the Chi-square FSA.In this experiment, we established that the ensemble algorithm found the best results compared to our previous work [23], in which six MLTs were applied.We showed our outcomes obtained from five ensemble and two deep learning algorithms in Tables 5-11.All comparisons of our outcomes with other previous works are shown in Table 15.We used seven performance measurements for evaluating the model's performance: PPV, Sensitivity, F-measure, AUC_ROC_score, Accuracy, MCC, and Cohen_Kappa_score.All proposed ensemble algorithms produced excellent results for DC, FE, and LM datasets.This research work answers the research questions (discussed in the introduction section).
To answer the RQ1, Five ensemble and two deep learning models are applied.Ensemble approaches have been shown to be quite good at predicting code smells.Furthermore, to answer the RQ2: the Chi-square FSA is applied to improve the performance accuracy.The best metrics are identified by the Chi-square FSA as shown in Table 4.The results indicate that it improves the accuracy of all ensemble methods for all datasets.However, the improvement is different for each model and dataset combination.Bagging, Gradient boosting, and XGBoost algorithms give the best accuracy, but feature extraction has no significant effect on the Max voting algorithm.Our implemented code and datasets are available at the link: https://github.com/seemadewangan/AdaBoost-Model-with-Chisquare-.git(accessed on 4 September 2022).To answer the RQ3, we applied a SMOTE balancing technique.

Result and All Model Comparison of Our Approach with Other Correlated Works
This subsection presents models applied by various authors and the greatest accuracy they obtained for the same dataset (Data class, God class, Feature envy, and Long method).The previous literature [7,20,22,36,37] proposed various types of machine learning, and ensemble learning algorithms on the same datasets (Fontana et al. [7]), and each author found different results.Table 16 shows all the model names applied in the previous literature.
Various authors applied machine learning and ensemble learning algorithms for the data class dataset.First, Fontana et al. [7] created this dataset and applied machine learning to it.They found 99.02% greatest accuracy using the B-J48 Pruned algorithm.After that, Mahvish et al. [25] obtained the greatest accuracy, 99.70%, using the RF model.They applied Deep learning and five other machine learning algorithms with genetic algorithmbased FSA and Grid search-based parameter optimization techniques.Dewangan et al. [23] also applied six machine learning algorithms with the Chi-square and Wrapper-based FSA and Grid search parameter optimization and obtained 99.74% greatest accuracy using RF.But in the earlier literature, the authors neither handled class imbalance nor studied the performance of boosting and bagging ensemble learning algorithms.Therefore, in this work, we applied five ensemble learning and two deep learning algorithms with the Chisquare FSA and SMOTE class balancing techniques.We obtained 100% accuracy using the Max voting algorithm.In this way, we found that ensemble learning is the best algorithm for detecting the code smells from the data class dataset.For the god class dataset, first, Fontana et al. [7] created this dataset and applied sixteen machine learning models to these.They found 97.55% greatest accuracy using the Naïve Bayes algorithm.After that, Mahvish et al. [25] obtained the greatest accuracy, 98.48%, using the GBT model.They applied Deep learning and five other machine learning algorithms with genetic algorithm-based FSA and Grid search-based parameter optimization techniques.Dewangan et al. [23] also applied six machine learning algorithms with Chi-square, Wrapper-based FSA and Grid search parameter optimization and obtained 98.21% greatest accuracy using the RF, which is not a good result as compared to previous literature.But in the earlier literature, the authors neither handled class imbalance nor studied the performance of boosting and bagging ensemble learning algorithms.Therefore, in this work, we applied five ensemble learning, two deep learning algorithms with Chisquare FSA and the SMOTE class balancing technique.We obtained 99.24% accuracy using the Bagging algorithm.In this way, we found that ensemble learning is the best algorithm for detecting the code smells from the god class dataset.
Various authors applied machine learning and ensemble learning algorithms for the feature envy dataset.First, Fontana et al. [7] created this dataset and applied machine learning to it.They found 96.64% greatest accuracy using the B-JRip algorithm.After that, Mahvish et al. [25] obtained the greatest accuracy, 97.97%, using the DT model.They applied Deep learning and five other machine learning algorithms with the genetic algorithm-based FSA and Grid search-based parameter optimization techniques.Guggulothu et al. [20] applied five machine learning algorithms and obtained 99.10%, the greatest accuracy using the B-J48 Pruned algorithm.Dewangan et al. [23] also applied six machine learning algorithms with Chi-square and Wrapper-based FSA and Grid search parameter optimization and obtained 98.60% greatest accuracy using the DT model, which is not a good result as compared to previous literature.But in the earlier literature, the authors neither handled class imbalance nor studied the performance of boosting and bagging ensemble learning algorithms.Therefore, in this work, we applied five ensemble learning, two deep learning algorithms with Chi-square FSA, and the SMOTE class balancing technique.We obtained 100% accuracy using all five ensemble models (AdaBoost, Bagging, Max voting, GB, XGBoost).In this way, we found that ensemble learning is the best algorithm for detecting the code smells from the feature envy dataset.
For the Long method dataset also, various authors applied various types of machine learning and ensemble learning algorithms.First, Fontana et al. [7] created this dataset and applied machine learning to it.They found 99.43% greatest accuracy using the B-J48 Pruned algorithm.After that, Alazba et al. [22] obtained the greatest accuracy of 99.24% using the Stack-SVM model.They applied 17machine learning algorithms with the gain FSA.Dewangan et al. [23] also applied six machine learning algorithms with Chi-square and Wrapper-based FSA and Grid search parameter optimization and obtained 100% greatest accuracy using the LR.But in the earlier literature, the authors neither handled class imbalance nor studied the performance of boosting and bagging ensemble learning algorithms.Therefore, in this work, we applied five ensemble learning and two deep learning algorithms with the Chi-square FSA, and the SMOTE class balancing technique.We obtained 100% accuracy using all five ensemble models (AdaBoost, Bagging, Max voting, GB, XGBoost).In this way, we found that ensemble learning is the best algorithm for detecting the code smells from the feature envy dataset.

Statistical Analysis
We used a Paired t-test to find out whether there was a statistically significant difference between the two classifiers such that we could employ only the best one.This paired t-test needs the utilization of N different test sets on which to calculate each classifier.We used ten-fold cross-validation for N test sets.The accuracy for each classifier across each code smell dataset is shown in Table 17.We calculated the statistical analysis with ten-fold cross-validation through the F-measure using a Paired t-test.We observed that for the Data class dataset, the GradientBoosting and AdaBoost algorithms were the highest-scoring algorithms.For the God class dataset, the Max voting algorithm was the highest-scoring algorithm.For the feature envy dataset, the Maxvoting and XGBoost algorithms were the highest-scoring algorithms.For the Long method dataset, the Maxvoting algorithm was the highest-scoring algorithm.Therefore, the Maxvoting classifier is best for code smell detection.

Threats to Validity
This subsection discusses threats related to internal validity, external validity, and conclusion validity.One of the threats to internal validity is the dataset.It is the most serious internal threat to our experiment.As mentioned above, Fontana et al. [7] developed the dataset that we used for this study.They created the database by employing code smells consultants to choose candidates from a large collection of 74 diverse software applications (Qualitas Corpus) and carefully certifying the 420 instances to every code smell.To create this collection of datasets, many metrics (features) are assessed.Many of these metrics may or may not have an effect on the outcomes of the models that were applied.The second threat to internal validity is the feature selection technique that we used-Chi-square FSAs.Chi-Square is sensitive to small frequencies in features considered.Generally, when the expected value in a feature is less than five, the Chi-square can lead to errors in conclusions.To handle this threat, we analyzed the datasets and found that this condition does not occur in our dataset.
Threats to external validity in our study are as follows: The first issue is that the dataset only contains two code smells: class-level and method-level smells.The second vulnerability is connected to the software applications that are used to produce the dataset, which are entirely in Java language code.As a result, our technique may not be suitable for C and C++ programming languages.
Threats to conclusion validity are summarized as follows: They are related to evaluating the model's performance.The metrics we used for evaluation of models' performance may not suffice.We tried to manage this threat by using multiple evaluation metrics, such as PPV, Sensitivity, F-measure, AUC_ROC_score, Accuracy, MCC, and Cohen_Kappa_score.It was further improved using ten-fold cross-validation.

Conclusions and Future Scope
This paper proposed ensemble and deep learning methods to detect the code smells.Four code smell datasets, DC, GC, FE, and LM, were used, created by Fontana et al. [7], using 74 open-source systems.The Chi-square FSA was applied to select the best metrics from each dataset to improve performance accuracy.
The five ensemble MLTs (AdaBoost, Bagging, Max Voting, Gradient Boosting, XG-Boost), and two deep learning (ANN and CNN) are applied to detect the code smells.This research work is implemented two-fold: (i) the first fold applied ensemble approaches to detect the code smells and (ii) the seven performance measurements (PPV, Sensitivity, F-measure, AUC_ROC_score, Accuracy, MCC, and Cohen_Kappa_score) are computed in the second fold to compare these ensemble MLTs.Chi-square FSAs with a ten-fold cross-validation approach wereused to improve accuracy.
The AdaBoost algorithm achieved the greatest accuracy of 100% for the FE and the LM datasets when the number of selected features were eight, nine, 10, and 12, and the whole set of metrics, while the worst accuracy was 95.23% for the GC dataset when the number of selected features was eight.
The Bagging algorithm achieved the greatest accuracy of 100% for the FE (when the number of selected features were 10and 12) and the LM (when the number of selected features were 11, 12and the whole set of metrics) datasets.The worst accuracy of 97.62% was achieved for the DC (when the number of selected features were nine, 11, and 12) and GC (when the number of selected features were nine and 10) datasets.
The Max Voting algorithm obtained the greatest accuracy of 100% for the DC (when the number of selected features was nine), the FE (when the number of selected features was11, and the whole set of metrics), and the LM (when the number of selected features wasnine and the whole set of metrics) datasets.The worst accuracy, 91.45%, was obtained for the FE dataset for 12 selected features.
The Gradient Boosting algorithm obtained the greatest accuracy of 100% for the FE (when the number of selected features was eight and nine) and the LM (when the number of selected features was eight, nine, 10, 11, 12 and the whole set of metrics) datasets, while the worst accuracy, 95.74%, was obtained for the FE dataset if the number of selected features was 10 and 11.
The XGBoost algorithm obtains the greatest accuracy of 100% for, the FE (when the number of selected features was 10) and the LM (when the number of selected features was eight, nine, and 10) datasets, while the worst accuracy of 96.42% was obtained for GC if the number of selected features was eight.
The ANN algorithm obtains the greatest accuracy of 99.12% for the DC (when all features were selected), while the worst accuracy of 97.12% was obtained for the GC if the number of selected features was eight.
The CNN algorithm obtains the highest accuracy of 99.56% for the FE (when the number of selected features was 12), while the worst accuracy of 97.12% was obtained for the GC if the number of selected features was eight.
We utilized two kinds of smells in this paper: class and method level smell with limited features.This paper presents several experiments that would be interesting for software developers as well as research practitioners working in this or a similar domain.
In future work, we are planning to improve results by applying algorithms to solve data augmentation techniques.Other learning algorithms as well as feature selection techniques should also be explored to find the best techniques for code smell detection.

Table 1 .
Previous work on code smells detection.

Table 2 .
Automatic Detector Tools and Techniques (ADVISORS)

Table 5 .
Results of AdaBoost algorithm.

Table 6 .
Results of Bagging algorithm.

Table 7 .
Results of Max voting algorithm.

Table 8 .
Results of Gradient boosting algorithm.

Table 10 .
Results of ANN algorithm.

Table 11 .
Results of CNN algorithm.

Table 12 .
Comparison Performance between Five Ensemble and Two Deep Learning Methods code smells Datasets.

Table 13 .
Outcomes of Ensemble algorithms for different sets of selected features.

Table 14 .
Outcomes of Ensemble algorithms with and without applied SMOTE.

Table 15 .
Result comparison of our approach with other correlated works.

Table 16 .
Result and all model comparison of our approach with other correlated works.