Empirical Assessment of Machine Learning Techniques for Software Requirements Risk Prediction

: Software risk prediction is the most sensitive and crucial activity of Software Development Life Cycle (SDLC). It may lead to the success or failure of a project. The risk should be predicted earlier to make a software project successful. A model is proposed for the prediction of software requirement risks using requirement risk dataset and machine learning techniques. In addition, a comparison is made between multiple classiﬁers that are K-Nearest Neighbour (KNN), Average One Dependency Estimator (A1DE), Naïve Bayes (NB), Composite Hypercube on Iterated Random Projection (CHIRP), Decision Table (DT), Decision Table/Naïve Bayes Hybrid Classiﬁer (DTNB), Credal Decision Trees (CDT), Cost-Sensitive Decision Forest (CS-Forest), J48 Decision Tree (J48), and Random Forest (RF) achieve the best suited technique for the model according to the nature of dataset. These techniques are evaluated using various evaluation metrics including CCI (correctly Classiﬁed Instances), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE), Root Relative Squared Error (RRSE), precision, recall, F-measure, Matthew’s Correlation Coefﬁcient (MCC), Receiver Operating Characteristic Area (ROC area), Precision-Recall Curves area (PRC area), and accuracy. The inclusive outcome of this study shows that in terms of reducing error rates, CDT outperforms other techniques achieving 0.013 for MAE, 0.089 for RMSE, 4.498% for RAE, and 23.741% for RRSE. However, in terms of increasing accuracy, DT, DTNB, and CDT achieve better results.


Introduction
The development of software consistently faces uncertain occasions that may have negative impact on the success of software development, such occasions are called software risks [1]. The risks have a vital impact on software requirements. They turn out to be the reason for damage to the software or stakeholders [1][2][3]. The Standish Group research illustrates an astounding 31.1% of projects terminate before completion while merely 16.2% software projects are completed on-time and with-in-the-budget. In bigger organizations, the result is even worse: only 9% of the projects are deployed on-time and within the budget. Moreover, in the completed projects, many are no more than a mere shadow of their original specification requirements. Projects completed by the largest American companies have only approximately 42% of the originally proposed functionalities [4]. Risk assessment has basically two strategies, proactive and reactive. According to the literature the reactive strategy is not a mature strategy to assess the risk, since it increases the budget schedule and resources but degrades the quality and success of the project. Therefore, a proactive strategy has been employed using machine learning techniques to reduce the chances of project failure [5]. Also, the prediction of risks at this stage is more beneficial and raises the productivity of software. It also helps in reducing the probabilities of software project failure when risks are managed properly in requirements gathering phase [1].
In our preliminary paper [1], a risk dataset was proposed that is comprised of software requirements, risks attributes and relations between them, which are also necessary for the prediction of risks in the new software projects. The dataset can be used to predict risks over the Software Requirements Specification (SRS) of a project using machine learning techniques.
In this study, the main focus is on the empirical analysis of ten Machine Learning (ML) techniques. These include: K-nearest Neighbors (KNN), Average One Dependency Estimator (A1DE), Naïve Bayes (NB), Composite Hypercube on Iterated Random Projection (CHIRP), Decision Tree (DT), Decision Table/Naïve Bayes Hybrid Classifier (DTNB), Credal Decision Tree (CDT), CS-Forest, J48, and Random Forest (RF). All these techniques are explored first time for risk prediction and are employed on our previously proposed dataset [6].
It is relevant to note that although 10 different ML techniques are chosen on the basis of a pre-selection process where a comparison has been made among these classifiers on the basis of their effective results such as fast training/build time and accuracy on the different datasets as shown in Table 1. Moreover, these classifiers are tested and observed on the basis of their CCI on Risk dataset [1], using 10 cross fold validation test for the selection of further studies. It is necessary to mention that the Neural Network-based classifiers were not considered in this paper. NN-based techniques have some disadvantages, e.g., they usually require a large size of input, produce unexpected behavior due to the black boxes, learning process may take long time with no guarantee of success, it is difficult to determine the proper network structure and it initially depends on format mechanism of the input [16,17]. The NN typically need thousands or millions of labeled samples [17,18], whereas the risk dataset contains only 299 instances to train a NN. In fact, in our preliminary selection process the CCI obtained were not good enough to keep it further in this study, therefore we kept NN out of this study. The Pre-selection comparison is presented in Table 2.

Overview of Risk Prediction
This research aims to resolve the predicament of late identification of risk and their effect on quality, schedule, and budget of the ongoing software project. Because most recent risk prediction methodologies have potential to measure software risks in the upcoming stages, naturally from software design phase or code of the software life cycle, these methodologies have potential to identify risks but have limited ability in avoiding these risks from occurring [1]. The risks are caused by several factors during the SDLC, and can lead to failure of the software project [2][3][4]. These factors are considered to be one of the main reasons of software failure causes, they result from the less software engineering theory principles and techniques. These factors should be resolved as soon as possible to lessen the unexpected failure of the software project.
As per our previous paper, we concluded that no dataset found that contain essential attributes for software requirement risks. Therefore, in that paper we introduced a dataset containing attributes and their relations between software requirements and risks [1]. Moreover, a model is needed to predict risk using the risk dataset using classification techniques of machine learning.
The core aspects of the software project which need to be improved are quality of overall project, budget, and schedule by predicting the risk earlier. The risk prediction model helps to identify risk level of an instance (software requirement) of new project using risk dataset. However, the project/risk manager will be able to customize and manage the overall process of risk prediction. The basic model of the risk prediction using ML techniques has been introduced. This model contains four main components as demonstrated in Figure 1 and briefly discussed below.
• Risk Identification: The very first stage of Software Risk Prediction Model is Risk Identification, where the Risk Manager/Project manager will identify the hazards that may distract the time, resources, or costs of the project. A hazard is an unfortunate event if it occurs, and is detrimental to the successful completion of the project. It is performed using "checklist". The requirements from SRS with risk threats are marked checked for further analysis. The checklist is then headed to the next stage [5]. • Risk Analysis: The identified risks then convert into decision-making information. The probability and the significance of each risk is assessed through risk analysis, the risks are transform into decision-making information that were identified [5]. Each risk is then measured and a decision made about the possibility and the seriousness of the risk. The attribute "Risk Level" in the risk dataset helps to classify the requirements among five risk levels [1]. • Risk Prioritization: This is the output stage of the Model, where the analyzed Risks are Prioritized. The requirements having high "risks level" are transferred above in the list and the requirements having low "risks level" drop to the bottom. • Requirement risk Dataset: The Dataset contains Risk measures for software requirements are available on Zenodo datasets [6]. The risk dataset comprises of the attributes that are related to risks and requirements of software project [1]. In the proposed model, the requirement set is input to the model to produce the risk level of the requirements, by which the project manager/domain expert can easily plan and mitigate the risks at earlier stages.

The Research Framework
The risk dataset is comprised of the attributes that are related to risks and requirements of software project [1]. These risks have a negative impact on the success of the software development. In the proposed model, the requirement set is input to the model to produce the risk level of the requirements, by which the project manager/domain expert can easily plan and mitigate the risks at earlier stages.
The research has two main phases, i.e., risk prediction model for software requirements and Model Evaluation and Comparison. The risk-oriented dataset and its filtration was necessary for our proposed model since that is done in our preliminary work [1], The dataset is openly available for further exploration and improvement. The proposed research framework is demonstrated in Figure 2. The first step of the research is Preparation of the Model for requirement risk prediction; here we have chosen the classifiers on the basis of Table 1. The second step is Model Evaluation and Comparison, here in this step we evaluated model on chosen classifiers using Risk dataset for comparison to choose best suitable classifier. In the third step the Proposed model is validated using new set of requirements. The last step contains the results of Model Validation.

Model Evaluation and Comparison
In this phase, KNN, A1DE, NB, CHIRP, DT, DTNB, CDT, CS-Forest, J48, and RF are employed on the risk dataset. We trained these models on WEKA (Waikato Environment for Knowledge Analysis) Tool and the parameters of every algorithm were tuned to take best output into consideration and validated these using 10-cross fold with the discrete dataset [1], and supplied a sample set of 299 instances to the model. The 10-fold cross-validation test is applied to every classifier to assess its performance. The provided learning set is break up into 10 distinct subsets of a similar size. The "fold" here is referred to as the number of subsets created. These subsets then are used as one for training and others for testing and this loop is performed until every subset is being trained and tested by the model [19]. We applied minimum and maximum number of k folds to the dataset, but they mostly decrease the efficiency due to change in the percentage of training and testing. Whereas the 10-fold cross validation performed better as compared to any other selection of k folds. Therefore, the 10-fold cross validation method is selected to validate the models and to avoid over fitting and under fitting during the training process [19]. The evaluation is done on the basis of CCI, MAE [20,21], RMSE [20][21][22], RAE [23], RRSE [23], precision [21,24], recall [21], F-measure [21], MCC [25,26], ROC Area [27], PRC Area [28] and Accuracy [27,29,30]. The results are presented in Section 5.

Requirement Risk Dataset
The dataset is taken from [1]  The dataset contains 299 instances (Requirements) collectively as mentioned in Table 3. The data types of the attributes were assigned according to nature and value set of the data which were achieved from Boehm's risks management [1,31,32] as shown in Table 4.

K-Nearest Neighbour
KNN classifiers are instance-based learning techniques where the training feature vectors forecast the class of an unidentified trial data. Classification of trial data is grounded on main stream division of its neighbours (training models). Therefore, trial data is allocated to the class label of training models that ambiances it in majority [3,7,33]. Tables 5 and 6, respectively, present the error rate and accuracy outcomes achieved by KNN using several numbers of k and presented best outcome of results among them. In Table 6, the first rows present the accuracy measures while rows 2-5 present the individual outcomes of each assessment measure. The percentage of CCI is considered to be the outcome of accuracy.

Average One Dependency Estimator
A1DE is a probabilistic technique, recycled for habitual classification complications. It shows enormously accurate classification by being an average of all-encompassing of a slight space of different Naive Bayes (NB)-like models that have punier independence possibilities than NB. A1DE was essentially planned to address the attribute-independence problems of a standard NB technique. It was intended to address the attribute-independence issues of the predominant NB classifier [34]. Table 7 presents the error rate details with CCI, while Table 8 presents the overall accuracy outcomes achieved via A1DE.

Naïve Bayes
The NB classifier is a probabilistic statistical classifier. The term "naive" indicates a restrictive independence among attributes or features. The naive supposition decreases computation convolution to a simple multiplication of likelihoods. One main advantage of the NB classifier is its speed of use. That is because it is the greenest technique amid classification techniques. As an effect of this openness, it can punctually contract with an informational index with abundant abilities [35,36]. The CCI and error rate achieved by NB are presented in Table 9 as well, while all the accuracy outcomes are listed in Table 10.

Composite Hypercube on Iterated Random Projection
CHIRP pulsation solidity procedure transforms a long period frequency-coded pulse into a narrow pulse of significantly improved generosity. It is a technique used in radar and sonar systems for the reason that it is a method whereby a narrow pulse with high peak power can be derived from a long duration pulse with low peak power. It can also reduce the hardware demands [9]. Tables 11 and 12, respectively, show the error rate and accuracy details accomplished via CHIRP.  Table   Decision tables are used to model complicated logic. They can make it easy to see that all possible combinations of conditions were considered and when conditions are missed, it is easy to see those. In a DT model, a categorized structure, which contains two features at each level, is constructed. Moreover, it contains three components: condition rows, action rows, and rules [11,37]. Here, the best performance of DT in terms of reducing error rate is listed in Table 13 while accuracy details are presented in Table 14.

Decision Table/NaïVe Bayes Hybrid Classifier
DTNB is a class for building and using a Decision Table/Naïve Bayes hybrid classifier. At every plug in the search, the algorithm assesses the quality of apportioning the attributes into two separate subsets: one for the DT, and the other for NB. A frontward selection search is used, where at each step, selected attributes are demonstrated by NB and the residue by the DT, and all attributes are demonstrated by the DT initially. At each step, the algorithm likewise reflects releasing an attribute exclusively from the model [11]. In this research the outcome in terms of reducing error rate along with CCI are presented in Table 15 while the accuracy details are presented in Table 16.

Credal Decision Trees
CDT is a technique used to enterprise classifier beached on inaccurate opportunities and implausibility measures. All the way through the formation process of a CDT, to bypass producing a too-problematical decision tree, a new standard was presented: stop once the total improbableness increases due to splitting of the decision tree. The function used in total improbability dimension can be fleetingly articulated [38,39]. Here, Table 17 shows the error rates details along with CCI results while Table 18 presents the accuracy details achieved via CDT.

Cost-Sensitive Decision Forest
CS-Forest performs a cost-sensitive clipping as a supernumerary of the clipping used by C4.5. C4.5 clips a tree if the credible number of mis-classifications for upcoming minutes does not increase dramatically due to the clipping. However, CS-Forest clips a tree if the credible classification cost for upcoming minutes does not increase dramatically due to the clipping. Moreover, unlike Cost-Sensitive Decision Tree (CS-Tree) CS-Forest endures a tree to first absolutely develop and then get clipped [40]. Tables 19 and 20 correspondingly present the error rates detail along with CCI outcomes and accuracy details succeeded by CS-Forest.

J48 Decision Tree
J48 is an improved version of C4.5. The method of this algorithm is to use the divideand-conquer technique. It practices the clipping method to construct a tree. It is a corporate method which is used in information gain or entropy measures. Thus, it is similar to a tree structure with root node, intermediate, and leaf nodes. Node holds the decision and helps to obtain the outcome [40,41]. The results of error rate details and accuracy details achieved via J48 are presented in Tables 21 and 22 respectively. We applied 10-fold cross validation to J48 Decision Tree on the Risk Dataset, and CCI achieved 288 (96.321%). The MAE is 0.018, RMSE is 0.120, RAE is 6.516, and RRSE is 32.091.

Random Forest
RF categorizes all trees in the forest in the classification procedure by arrangement of the expectation of the tree structure, each appraised according to the same dissemination and the random vector values. random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result [42][43][44]. Table 23 presents the CCI outcomes of RF along with error rates details while the accuracy details are presented in Table 24. In each of table presenting CCI details, the percentage of each individual CCI achieved via individual technique is considered to be the accuracy of that technique.

Analysis of Experimental Results
In this study, ten different classification techniques are employed to find the best classifier in terms of reducing the error rate and increasing the accuracy for RPM. Each technique is evaluated on different evaluation measures in which MAE, RMSE, RAE%, and RRSE% are used to evaluate the error rate while precision, recall, F-measure, MCC, ROC area, PRC area, and CCI are used to evaluate the accuracy of each individual technique. This section is divided into two subsections that are analysis phase 1 and analysis phase 2. In the first section, all error rate measures are analysed while the second section presents the analysis of all accuracy measures.

Analysis Phase 1
In this phase, each technique evaluated using error rate measures is analysed.  Figures 3 and 4, respectively, with standard deviation bar. Standard Deviation (SD) is a quantity of distribution of the data commencing the mean. It is a possession of the variable that bounces an impression of the assortment in which the values scatter (dispersal of the data). Error bars of ten symbolize one SD of indecision. The magnitudes are not the same and so the extent carefully chosen should be specified explicitly in the graph or supporting text. Archetypally, error bars are used to display the SD. The artless thing that we can do to enumerate variability is to calculate the SD. Essentially, this articulates us how much the values in each cluster incline to deviate from their mean.

Analysis Phase 2
In this phase, all accuracy measures are analyzed. Table 26 presents the performance of each technique evaluated via precision, recall, and F-measure. In this table, the first column represents the evaluation measures while the second column represents the employed techniques. The rest of the columns in the table present the outcome of each individual technique evaluated through each measure. The outcomes show that DT, DTNB, and CDT beat the rest of the employed techniques and produce the same results with each evaluation measure, i.e., 0.980. For better understanding, these results are also presented in Figure 5 through columns and SD bars.  The evaluation through MCC, ROC area, and PRC area are presented in Table 27. These outcomes show that evaluating each technique through MCC, DT, DTNB and CDT outperforms the other techniques and generates the same results that is 0.975. However, On ROC area and PRC area RF outperforms other employed techniques and generates 0.994 and 0.975 results respectively for ROC area and PRC area. The columns in Figure 6 elaborates the analysis of this phase.  CCI is considered to be the accuracy of the individual employed technique. The dataset used in this research consists of a total of 299 instances. Table 28 presents the list of all CCI achieved via each individual technique. In the table, the second column shows the list of all used techniques, third column represents the CCI details while fourth column shows the accuracy in percentage. The inclusive analysis of the table shows that DT, DTNB, and CDT beat all the other employed techniques in terms of achieving higher accuracy, which is 97.993% for DT, DTNB, and CDT. The CCI analysis are further shown in Figure 7 with the help of a bar chart for better understanding.

Conclusions and Future Research Directions
Requirement risk prediction is an active research area with increasing contributions from the research community. This research aims to explore ML techniques for the first time for the requirement risk predictions using new dataset. This research contributed as follows. Ten ML techniques were explored for requirements risk prediction. A detailed comparison of these techniques for the proposed model is performed.
Requirement risk prediction had found great impact on the success or failure of the software project. To overcome this issue, a new model was needed to predict the risk in the project early, therefore, a model and optimal classifier is proposed.
Ten different ML techniques are employed and compared in this research and evaluated for reducing error rates and increasing accuracy of the proposed model. Among all these techniques CDT performed well in terms of overall accuracy, higher CCI, Lower MAE, and RMSE. The experiment shows the results as 0.0126 for MAE, 0.089 for RMSE, 4.498% for RAE, and 23.741% for RRSE. The results achieved are 0.980 for precision, recall and F-measure, 0.975 for MCC, and 97.993% for accuracy. DT, DTNB, and CDT perform better using precision, recall, F-measure, MCC, and accuracy From the experiments conducted in this research, it was observed that CDT was found to have the lowest error rates that are MAE of 0.013, RMSE of 0.089, RAE of 4.498, and RRSE of 23.741. We also observed that CDT was found to be the best in terms of CCI, i.e., 293 and accuracy of 97.993%. These results show that CDT can claim to be the best suitable and optimal classifier for the prediction of software risks.
The comprehensive outcomes of this study can be used as a reference point for other researchers. Any assertion concerning the enhancement in prediction through any new model, technique or framework can be benchmarked and verified. Some future directions are also listed as the accuracy may further be improved by employing other classification and pre-processing techniques in the proposed model. Class imbalance matter ought to be committed on these datasets. Furthermore, to increase the enactment, feature selection and ensemble learning techniques should also be explored. This research can be further validated using different assessment measures such as Mean Magnitude of Relative Error (MMRE), PRED etc. The dataset contains 299 instances from four different sources which can be enhanced by new requirements from other software project sources. It may also bring new challenges in the prediction of risks at requirement gathering phase.