An Ensemble Machine Learning Technique for Functional Requirement Classification

: In Requirement Engineering, software requirements are classified into two main categories: Functional Requirement (FR) and Non ‐ Functional Requirement (NFR). FR describes user and system goals. NFR includes all constraints on services and functions. Deeper classification of those two categories facilitates the software development process. There are many techniques for classifying FR; some of them are Machine Learning (ML) techniques, and others are traditional. To date, the classification accuracy has not been satisfactory. In this paper, we introduce a new ensemble ML technique for classifying FR statements to improve their accuracy and availability. This technique combines different ML models and uses enhanced accuracy as a weight in the weighted ensemble voting approach. The five combined models are Naïve Bayes, Support Vector Machine (SVM), Decision Tree, Logistic Regression, and Support Vector Classification (SVC). The technique was implemented, trained, and tested using a collected dataset. The accuracy of classifying FR was 99.45%, and the required time was 0.7 s.


Introduction
There are different definitions of requirements in different books and manuals. Considering the definition of the Institute of Electrical and Electronics Engineers (IEEE) standards, a requirement is a capability or condition needed by a user or a system to satisfy an objective [1].
Software requirement classification affects the other activities of Software Development (SD). For example, prioritization-the filtering of relevant requirements-is facilitated by effective classification [2]. Most requirements are classified using two main categories: Functional Requirement (FR) and Non-Functional Requirement (NFR). Sommerville defined FR as services that are expected from a system and their reactions to certain inputs. FR describes both user and system goals. NFR includes all constraints on services and functions [3]. Deeper classification of these two categories can facilitate the SD process [3]. The most common categories of NFR in the reviewed papers can be categorized into four to eleven classes. These classes include maintainability, operability, performance, security, usability, and reliability.
Machine Learning (ML) classifiers have gained importance widely not only in the software engineering field. ML classifiers have recorded the best results in different fields compared with other techniques of classification. In [4] different methods and techniques for forecasting electricity prices have been reviewed, grouped, and discussed in detail. Computational Intelligent tools that are based on ML techniques and Artificial Intelligence (AI) algorithms such as Support Vector Machine (SVM) were described as the best among other statistical methods in modeling the features of pricing electricity. Another area, namely Innovation Capability(IC) where ML showed a significant impact on result improvement and reduction in the number of variables that are required to make decisions is described in [5].Regularized Least Squares, Deep Neural Networks, and Random Forests used in the experiments have been applied on data of the 28 members of European Union(EU). The three used algorithms reduced the complexity of analysis and kept the focus on required features to produce powerful predictions.
NFR statements classification has been a main research concern for a long period of time. Machine Learning (ML) techniques have been adopted for this purpose both widely and successfully. Classification has been tested using many effective algorithms, such as Naïve BayesSVM Decision Tree, and Support Vector Classification (SVC). Furthermore, the datasets for this purpose are available for subsequent researchers. The results of this technique are promising, as it achieved over 90% accuracy in some experiments according to the reviewed published works on the NFR classification objective [2,[6][7][8][9][10][11][12][13][14][15][16][17].
On the other hand, FR has not gained as much attention from researchers as NFR, and FR is featured in fewer published papers than NFR [18]. One of the best practices in FR classification is presented in [19], where the FR was classified to six different classes as follows:


Solution requirements: This type describes the actions that must be carried out by the system or the action that is carried out by the system or the user.  Enablement requirements: This class determines the capabilities offered to the user by the system. It may determine the subsystem that offers the capability, or it may not determine the subsystem that offers this capability.  Action Constraint requirements: This class describes the allowable actions for the system or subsystem or the actions that are not allowed. This class also may determine business rules that control some actions in the system.  Attribute Constraint requirements: This class is related to constraints on attributes or entity attributes.  Definition requirements: This class is used to define entities.  Policy requirements: This class is to specify the policies that the system must follow.
The classes described above were shown to satisfy different stakeholders in [19]. Moreover, these classes were used by subsequent researchers successfully in [20].
As manual classification consumes time and requires effort from both analysts and experts, many published papers have tried to successfully transfer the task of classification to automation either via traditional software solutions or using ML models [21]. This paper introduces an enhanced technique for weighted ensemble voting in ML to classify FR into multiple classes. The paper examines a new ensemble approach that uses the accuracy per class from the confusion matrices of the base ML classifiers to create a numerical matrix to store the accuracy per class for each classifier to find the best result and pass it as a weight to the weighted ensemble voting classifier.
The rest of the paper is organized as follows. Section 2 features a review of related works. Section 3 provides the materials and methods. Experimental details are outlined in Section 4. Then, the results are detailed in Section 5, followed by a discussion in Section 6. Finally, the work is concluded in Section 7.

Related Work
This section summarizes the previous works on classifying FR using traditional manual techniques or automated techniques and the works that utilized ML techniques to classify FR to various categories.

Traditional Techniques of Classification
In [19], the authors aimed to analyze software requirements using a tool called a Requirement Analysis Tool (RAT). This tool is intended for use by different stakeholders, such as end users and analysts. FR statements were categorized under different classes: solutions, enablement, action constraints, attribute constraints, definitions, or policy requirements. This method used the Lexical analyzer for tokenization and classification. Then, it used a syntactic analyzer. Although the classification and analysis were conducted using a traditional methodology, the results were promising. This method resulted in a decrease of 30%-50% in the required time to review the requirements.
H. Elazhary (2011) adopted the RAT for use in translation between English and Arabic software requirements and developed the Arabic Requirement Analysis Tool (ARAT). Tokenization and classification conducted using a lexical analyzer were then followed by the use of a syntactic analyzer. Many ambiguity issues were resolved by translating software requirements between English and Arabic [20] A. Ghazarian (2012) categorized FR into different classes: data input, data output, data validation, business logic, data persistence, communication, event trigger, user interface, user interface navigation, user interface logic, event trigger, external call, and external behavior. The Requirements Research Repository (RRR) was used for this study as it was used successfully by two previous experiments for the same purpose. The data used in the testing included 15 software projects with 1236 functional requirements. The results showed the percentage of each class. The highest percentage was observed for the output data class with a percentage of 26.37% [3].
The authors in [22] classified FR as solutions, enablement, action constraints, attribute constraints, definitions, or policy requirements. The authors used Python version 3.6 to implement an ambiguity prevention tool. The methodology used Finite State Machines (FSMs) to classify each requirement according to the syntax. The authors concluded that classification and transformation are not straightforward.
A. Martinez, M. Jenkins, and C. Quesada-Lopez (2019) sought to identify security requirements from FR through several experiments, such as user activation mapped to security templates featuring authorized access, confidentiality during storage, confidentiality during transmission, unique accounts, and logging authentication events. The experiments were applied to 33 graduate students from the University of Costa Rica divided into three groups. Some conducted the activity online, while the others engaged in the activity offline. Participants were given explanations using videos and presentations. The time taken ranged from 20 min to 194 min. The quality was measured using a scale from one to five, on which the results ranged from 2.66 to 4.57 [23].

ML Techniques of Classification
Software requirement specifications written in the Japanese language were used as a dataset for classification in [18]. FR statements were classified into four categories: requirements for the user interface, requirements for the database, requirements for system functions, and requirements for the external interface. On the other hand, NFR statements were classified according to eight main categories, and each main category was classified to a subcategory according to International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC)ISO/IEC25030:2007. After using a Convolutional Neural Network (CNN), the FR had the best results, with 0.89 precision, 0.94 recall, and a 0.91 F-score.
FR statements were classified into seven categories in [24] using Grounded Theory (GT). FR statements were classified into seven classes: external communication, business constraints, business workflow, user interactions, user privileges, user interface, and entity modelling requirements. Supervised learning was applied via ML algorithms, including Naïve Bayes, Bayes net, K-Nearest Neighbourhood, and Random Forest. The different numbers of FR statements in eight documents ranged from 208 to 6187. Precision ranged from 0.39 to 0.60, recall ranged from 0.40% to 0.71%, and the F-measure ranged from 0.47 to 0.60. About 450 FR statements selected from the insurance domain were used to test and classify FR into eleven categories: audit trail, batch processing, localization, communication, payments, print, report, search, third party interactions, and workflow. The Multinomial Naïve Bayes classifier recorded a recall ranging from 28% to 90% and precision ranging from 50% to 100% [25].
Conceptual clustering is used to classify the FR statements of any project based on previous similar projects. A case study on power supplies was used as an example of applying this approach. FR statements have been classified based on safety requirements, electrical specifications, general specifications, power failure detection for the Federal Communication Commission (FCC), and mechanical specifications, regulations, and ripples. This approach offers the opportunity to utilize the historical knowledge of experts on FR patterns [26].
Since the time needed for classification has been generally improved using automation, this time could be improved further by utilizing ML as in previous studies in general software requirement classification. Ensemble modeling in ML has been utilized only rarely for software requirement classification. In addition, as observed in the reviewed papers, classifying FR to further classes is not as much of a concern as classifying NFR. Thus, this research aims to enhance accuracy as a weight in the weighted ensemble voting approach to classify FR into six classes that were used in previous works [19,20,22] successfully implementing traditional automation approaches.

Materials and Methods
This section describes the dataset collection process, the dataset itself, and the proposed methodology to be applied to the collected dataset to achieve the desired objectives.

Data Collection Description
The dataset was prepared by the authors since there was no available dataset with an acceptable number of FR statements to be classified to the selected six classes. To prepare the dataset, the following activities were undertaken:


Searching for available sets of FR statements or software specification documents that were offered for research purposes. Several folders contained a number of software specification documents available online [27][28][29][30].  Collecting a reasonable number of FR statements in a spreadsheet with two columns: requirements and class. We used a total of 600 FR statements that include the same number of each desired class to ensure that the dataset is balanced.  Labeling each FR according to the syntax of each chosen class as solution, enablement, action constraint, attribute constraint, definition, or policy requirements.
The final dataset is 600 sentences; each is an FR statement that belongs to one specific class. These statements are kept in a spreadsheet and represent the first column that is named "Requirement". The label of each statement is written in the second column that is named "Label".

Methodology
The architecture illustrated in Figure 1 consists of several elements:

Data Pre-Processing
Pre-Processing involves translating an input requirement into a form that can be processed and passed onto ML or Deep Learning (DL) models using algorithms. ML is related to mathematical intelligence, as the input is determined by its syntactic elements, such as verbs and nouns, or their semantics as a group or domain, which defines variance in meaning. ML, an Artificial Intelligence (AI) application, supports processing tasks using algorithms and classifiers along with training sets for data and testing sets for data [31].
Familiar text preprocessing includes tokenization, case folding, stop word removal, stemming, and transformation. Input data for this research needed to be prepared as a suitable input for the base selected ML classifiers and the ensemble [32]. The steps for pre-processing are mostly the same as those for all base classifiers. As data pre-processing depends on the nature of the research, light preprocessing was used in this study, as shown in Figure 2


Tokenization is defined as separating the input data into tokens. A token is a group of letters joined with a semantic meaning with no need for further processing. Different tokenization methods can be applied to a text, so it is important to use the same technique for all texts used in an experiment [32]  Case folding is the process of unifying the cases of the letters in the entire text, but there can be some ambiguity if uppercase letters are used to distinguish different abbreviations [32].  Stop words are the parts of sentences with negative effects on multiclassification problems. Stop words include prepositions, pronouns, adverbs, and conjunctions [32]  Stemming refers to extracting the morphological root of a word. Several different techniques are used for this process, including lemmatization, the use of semi-automatic lookup tables, and suffix stripping [32].  The last step is transformation, which involves using word frequency to provide a score or identification (ID).
The Term Frequency-Inverse Document Frequency(TF-IDF) score for each token is calculated using the following equations: Another successful tokenization technique that works similar to TF-IDF is the countvectorizer, which returns a vector with the lengths of all terms in the raw data. This vector is numerical, which Transformation is the only difference between TF-IDF and the countvectorizer: the countvectorizer returns an integer count of the number of appearances of each term in the input, while the TF-IDF returns IDs as a result of multiple calculations [33].

ML Classifiers
SVM is a model that combines the power of conventional theoretical statistical methods and analytical simplicity. Moreover, it works well even on small datasets. Linear SVM classifiers were used in this experiment. The reason we selected linear SVM over non-linear SVM is the smaller amount of time needed for training SVM due to the low complexity in its calculations. Furthermore, classifiers have high dimension data applications and do not require more features to be added [34].
This algorithm is simple and can be described according to the following variables: Target labels , .
The condition is set as follows: where w is the weight vector and b is the bias. The non-linear equation is represented as In the above equation, the separating hyperplane that comes between two parallel hyperplanes is (8) with a margin width of ‖ ‖ .
The decision of the classifier follows the formula of .
The final SVM function is 

Naïve Bayes Model
The Naïve Bayes model is a machine learning classification model that is known for its independence assumption. This means that the probabilities of one instance are not affected by other attributes. It has been reported that the results of the Naïve Bayes classifier are usually accurate. Moreover, the Naïve Bayes classifier can underperform due to different issues caused by training data noise, variance, and bias [35].
According to the explanations of the algorithm in [36], the features or vectors are presented as ,…, from domain , where lowercase x represents the value of a vector. The unobserved class C is one of the m values represented as ∈ 0, … , 1 and obtained by g(x), where The discriminant function t The Naïve Bayes discrimination function is The predicted values for the dependent variable are found between 0 and 1 by applying the following regression formula [37]: The following formula is an example of transforming the probability of the dependent variable : where ́ is any value ranging between ∞ and ∞. Values transformed in this way will be used in the ordinary linear regression and the final equation is the following: The Decision tree is used to support decision making problems with hierarchical structures. The first node is called the root node, and the nodes that represent the other features until reaching the structure's final nodes are called leaf nodes, which represent the target classes/labels. Adjacent nodes are linked by branches. The results of testing link the last node at the current level to a node on the next level. The Decision tree has been described as a strong algorithm as it accepts numerical and ostensible features to guarantee the inclusion of all features. Furthermore, large and small datasets can be managed. The reported weaknesses when using large trees (such as biased decisions) do not apply to this research as our dataset is not very large [38].
Standard deviation measures the error in each node split and the decrease on its Standard Deviation Reduction (SDR) calculated using the following formula [39]: where T represents the samples that reach the node, m is the number of samples with no missing parameters, noted as the correction factor, and dividing the specific parameter produces left child node (TL) and right child node (TR) sets [39].

 Support Vector Classification (SVC)
In [40], SVC for binary classification problems was analyzed for use in developing a novel approach called Longitudinal SVC. This new approach was successfully designed to enhance performance. Linear SVC aims to classify and categorize the input data after fitting it into a hyperplane. Then, features need to be passed to the classifier. The hyperplane can be linear or nonlinear according to the data. The kernel is a type of hyperplane, while gamma is a value related to the nonlinear kernel. A gamma increase means more fitting of the training data. However, in some cases, it leads to overfitting. There is a penalty value called c that is used to control the correctness of classifying training points and the smoothness of the decision boundary. Moreover, an increase in c may lead to overfitting. In the case of a nonlinear hyperplane 'poly', the degree value is needed. This value can be set to '1′, which indicates a linear hyperplane. An increase in this value will lead to an increase in training time [41,42] The following formula is used for the SVC algorithm [43]: The training data are represented as:

Building a Confusion Matrix for Each Classifier
A confusion matrix is used to display a summary of each classifier performance. Each row represents the predicted class, and each column represents the actual class. Diagonal elements represent the recall values, and these values divided by the sum are the precision values [44].

Calculating the accuracy for each base classifier
The accuracy represents correctly classified data in proportion to the total data according to the following formula [45]:

Generating the Numerical Matrix
As enhancing the proposed ensemble classifier depends on the accuracy per class for each classifier, that accuracy needs to be calculated. Thus, the best classifier for each class needs to be determined. This process depends on the confusion matrices for all classifiers. In more detail, this refers to the classifier that manages to make the highest number of correct predictions for each class in the confusion matrix. To determine this classifier, a matrix was created to be filled using the confusion matrices of the base classifiers. The following is an example of how the matrix is filled from the confusion matrices. The classes are given numbers from 0 to 5, and the two classifiers are named A and B (Figure 3): In the above example (Figure 3), the best classifier for class 0 is classifier B, while the best for class 1 is classifier B. For class 2 and 3, both classifiers are equal. Thus, one of them will be chosen randomly. For class 4, classifier A is best, while for class 5, classifier B is best.

Ensemble Classifier:
These classifiers are different to existing ensemble approaches that combine ML classifiers to improve accuracy. The following are some of the most common models:

 Mean Ensemble Voting
In this type, the ensemble works to find the average decisions of all base classifiers according to the following equation adopted from [46]: 

Weighted Ensemble Voting
To predict the class label considering weight related to classifier , where is the characteristic function, and A is a set of class labels, which are computed using the following formula:

Accuracy in Weight Ensemble Voting
This approach works the same as weighted accuracy by replacing the weight with the accuracy of each base classifier. Here, is replaced by the accuracy, which is calculated by 

Proposed Ensemble voting
The type of ensemble adopted and modified in this study is Accuracy as Weight Ensemble Voting. The weight reflects the accuracy of each classifier, which already exists and is used in classification. Enhancement will determine the accuracy of each classifier for each class and thereby determine the best classification of each input. This model uses confusion matrices and a numerical array to store the values of the accurate predictions from the confusion matrices. In details, the significance of the proposed ensemble approach is a result of using accuracy of each class or label among all base ML classifiers. As the overall accuracy as an absolute measure of performance is misleading, it is not used in the proposed approach classification decisions. Moreover, due to the differences among base classifiers in the mechanism by which they make decisions, it is difficult to find the best classifier for the whole labelling process. Thus, the strength of each base classifier is utilized in the proposed approach. The following is a description of the approach: // find the maximum of each column that will represent the algorithm weight W < max_coloumn (Conf_matrix(i)) V_result< Voting_algorithm (Clf(i), W(i)) Voting_ Accuracy < V_Result / Y*100

Dataset
In the experiment, the data set is a list of functional software requirements used in previous realworld projects and found online to be used for research purposes. These requirements were selected according to specific criteria, as they should be free of spelling mistakes or typos. Furthermore, the number of requirements that belong to each class should be equal to obtain a balanced dataset. The total of the used FR statements is 600, as suggested in previous research on software requirement classification. Moreover, the small size of the dataset is not considered a problem with ensemble techniques [47]. The overfitting problem that is usually caused by the small size of a dataset was carefully solved over the selection of the parameters of each ML classifier as suggested in [48]. This means that there are 100 FR statements from each class. To the best of our knowledge, this is the first dataset that includes FR statements labeled to six different classes. The data were preprocessed according to the explained methodology in the proposed model section and were then split into training and testing groups, with the largest portion given to training (70%). This percentage was selected according to the proportion of the number of classes being used and the size of the dataset. We sought to ensure that the dataset was balanced from the beginning (i.e., containing the same number of FR statements from each class) and remained balanced after the split, as 70% training involves 70% from each class and 30% testing involves 30% from each class. This percentage was selected to ensure that the classifiers were trained sufficiently on all classes and learned the syntax of each, while leaving enough data to test the accuracy of each classifier and ensure that it performs well. Moreover, to confirm the assumption of the ideal split of data practically, a set of experiments on different data split percentages were conducted.

Software
Python 3.6 was used for implementing the base classifiers and the ensemble classifier. We used the PyCharm tool, which is user friendly, and its dataset allows for easy uploading. This is true for various dataset file formats, such as comma-separated values (CSV).
Scikit learn was also chosen, as it includes many libraries that can facilitate building ML classifiers and calculating the factors that were used in the evaluations. Ensemble models are supported widely by all types of classifiers.

Hardware
The computer was an ASUS Laptop, with an x64 Inter(R) Core (TM) i7-9750H processor and 17.0 GB of RAM, running a 64-bit Windows Operating System.

Results
This section provides the training settings and the results of the testing phase according to the percentage of the dataset split between training and testing (70%-30%).

Training
The parameters of each ML classifier must be specified, as they affect the results dramatically. For the SVM classifier, the parameters that need to be specified are c (cost), kernel, degree, and gamma. C is a regularization parameter since it controls the tradeoff between misclassification and large margins of error. Its value can be 1, 10, 100, or 1000, as a small value leads to constraint ignorance with a large margin of error, and vice-versa. The default value is 1.0. The kernel parameter is used to specify the kernel type, which can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed', or callable. The default is 'rbf'. The degree is only considered in case of a polynomial, and its default value is three. Gamma is a coefficient of the kernel in case it is 'poly', and this value determines the influence of the training example on the decision boundary. The default is 'scale', which can also be set to 'auto' [49,50]. The Naïve Bayes classifier has different parameters that need to be specified, including alpha, fit_prior, and class_prior. Alpha is the smoothing parameter and can be assigned to zero when there is no smoothing, while the default is 1.0. fit_prior is a Boolean parameter that indicates the learning of prior probabilities, and its default value is true. Class_prior is a parameter that determines whether to adjust the class data according to the priors, and the default value is zero [51]. SVC has more parameters that must be declared, but the main parameters are similar to SVM and set to default if linear SVC is selected. By default, the kernel is set to 'rbf', gamma is set to 'scale', and c = 1.0 [52]. The decision tree's first parameter is max_depth, which indicates the depth of the tree, where a deeper result indicates more splits and more gathered information. The range of max_depth is from 1 to 32. min_samples_split is a parameter used to assign the number of samples in each node, which can use one as the minimum or all samples as the maximum. Furthermore, the smallest required number of samples in each leaf can be defined through the min_samples_leaf parameter. max_features is related to the maximum number of features during the split [53]. The Logistic Regression algorithm also has parameters that must be set. The most important parameter is the regulation parameter c, where c = , and controls the tradeoff between the complexity and simplicity of the model. A lower value of means a more complex model, which indicates an overfit, and vice-versa. On the other hand, a c parameter that produces an opposite result for small values increases the simplicity of the model and is called underfitting, while high values increase complexity and facilitates better adjustments of the data [54].

Testing
To find the best approach for the proposed enhanced ensemble ML classifier, a number of experiments were conducted. The base ML classifiers and the proposed enhanced ensemble approach were tested under different circumstances.
 First, the base ML classifier performance was tested using preprocessing with TF-IDF and Countvectorizer. The results are shown in Figures 4 and 5, which illustrate the confusion matrices for the base classifiers, and Table 1, which summarizes the accuracy and required time for each base classifier in the proposed enhanced ensemble approach.  Second, different data split percentages of training and testing data have been tested (50:50, 40:60, 30:70, 20:80, 10:90) to find the idle split for the ML classifiers. The results are shown in Table 2, which summarizes the accuracy and required time for each base classifier in the proposed enhanced ensemble approach.  Third, the best three ML classifiers in terms of accuracy are selected to form the proposed ensemble approach, and the results are shown in Figure 6 and Table 3 (accuracy and required time).  Fourth, the best three base ML classifiers in terms of required time are selected and used to form the proposed ensemble approach. The results are shown in Figure 7 and Table 4.  Both preprocessing techniques, TF-IDF and Countvectorizer, were tested using the five base ML classifiers. The experiments revealed the performance of each preprocessing technique. By comparing the accuracy and time of the confusion matrices, Countvectorizer was shown to outperform TF-IDF in all aspects for all classifiers and the proposed ensemble. Thus, Countvectorizer was selected for preprocessing during the rest of the experiments. The proposed approach achieved 79.8% accuracy using TF-IDF due to the low accuracy of the base ML classifiers, and the required time was 2.17 s. TF-IDF, however, required less than one second (0.7) and achieved 99.45% accuracy.
Different split percentages were tested using the five base ML classifiers. The performance of each classifier was revealed under different split percentages. As the performance attributes include accuracy and time, the ideal split percentage is the one that shows best accuracy and time for the majority of ML classifiers as well as the proposed ensemble approach. From the table below it is clear that in case of splitting the dataset, 30% testing and 70% training are the best percentage splits to be used for the rest of the following experiments. The other data split percentages could show better accuracy for some ML classifiers and the proposed ensemble as well such as 10:90. On the other hand, it increased the time sharply as it doubled in some cases.     We retained only the three most accurate ML base classifiers, SVM, SVC, and Logistic Regression (with accuracies of 99.0%, 97.0%, and 99.0%, respectively), to monitor the accuracy of the proposed ensemble approach, as well as the elapsed time. We noted that the accuracy remains the same when using all five base ML classifiers. In both cases, the accuracy was 99.45%, while the time was enhanced slightly. Next, the three fastest ML base classifiers, Decision Tree, Logistic Regression, and Naïve Bayes (with recorded times of 0.013962 s, 0.012965 s, and 0.006983 s, respectively), were tested to monitor the effects on the accuracy and time of the proposed ensemble approach. Choosing the fastest ML classifiers regardless of accuracy decreased the accuracy of the proposed ensemble approach dramatically to 95.05%. The time improved to 0.037 s. Since time is of secondary concern after accuracy, this approach failed to meet the research objective. The most accurate classifiers required an acceptable time, while the fastest ML classifiers did not record acceptable accuracy compared to the rest of the base ML classifiers. Thus, only the three most accurate classifiers allow the proposed ensemble approach to achieve the objective of the research, which is to enhance accuracy in an acceptable timeframe for classifying FR.

Experimental Section
This section compares the proposed ensemble approach (using the most accurate ML classifiers after conducting experiments) with existing ensemble approaches.


Based on time and accuracy Table 5 illustrates the accuracy and the time of some existing ensemble approaches, including the proposed ensemble. Uniquely, the accuracy of the proposed approach reached 99.45%, which was the highest result. However, the time was not the best. Nevertheless, the time required was no longer than the worst time and remained acceptable for automated tools as a response time of classification. The worst ensemble accuracy was about 97% for accuracy as a weight approach, which depends on the overall accuracy of the base classifiers and is a misleading parameter when used alone. In that case, it provides sufficient accuracy to serve as a weight in voting. The mean ensemble provided the same percentage, which is problematic because the ensemble ignores accuracy completely and depends on the average of the base classifiers' decisions. Considering the importance of the classifiers as weights, the ensemble offered a better accuracy of 98.35%, with the mean ensemble providing the best time of 0.001001 s. The ROC metric is used to compare the output quality of different classifiers. The Y axis represents the true positive rate while the false positive rate is represented on the X axis. The larger area under the curve (AUC) represents better quality [55]. According to Figure 8, among the existing ensemble approaches the proposed ensemble approach has the best ROC curve and the largest AUC (0.96).  Table 6 theoretically compares the proposed ensemble approach with existing state-of-the-art methods that classify FR into different classes. These works have been classified according to the above sections as traditional approaches or ML approaches.

Discussion
The main difficulty in building the necessary datasets was collecting a sufficient number of valid FR statements that could fit the syntax of the six different classes of classification and then labelling them correctly in the spreadsheet. This process consumed a large amount of the experimental process time. The second difficulty related to the dataset was the preprocessing, as most of previous studies used TF-IDF. However, in this work, TF-IDF showed poor performance for all ML classifiers, which affected the performance of the proposed ensemble approach. Thus, the Countvectorizer was adopted. This method improved the efficiency of all ML classifiers and consequentially affected the performance of the proposed ensemble approach, as shown in Figure 5 and Table 1.
Overall, the ML classifiers performed well, and their accuracy values were above 90%. The best performance was recorded for SVM and Logistic Regression, with 99%, and the worst was 97% for the SVC classifier. In this case, the ensemble was able to avoid most of the errors made by the ML classifiers, and its accuracy was 99.45%, which is higher than any values for prior ML classifiers. These results were recorded in the experiments that used the three most accurate ML classifiers and the experiment that used the five ML classifiers, as the proposed ensemble approach selects the best performance among all classifiers. Thus, ML classifiers with poor performance did not affect performance.
In detail, the applied ML classifiers produced different areas of errors, as different ML classifiers misclassified different classes, as shown in the confusion matrices in Figure 8. Thus, because different ML classifiers use different methodologies in classification, they produce different areas of errors.
The time consumed by the proposed ensemble approach was improved by using the fastest ML classifiers. However, the accuracy decreased to 95.05%, as the ML classifiers performed poorly compared to the other excluded ML classifiers. On the other hand, the time consumed by the proposed ensemble approach using the five ML classifiers was slightly longer than the time when using the three most accurate classifiers (the difference was nearly 0.01 s).
Comparing the proposed ensemble approach with existing state-of-art methods showed that the proposed ensemble approach outperformed the others in accuracy. Although our method did not offer the best time, it still produced an acceptable time that did not exceed one second (0.7 s).
The previous experiments offer only one way of proving this concept, a process that will continue in future studies. For example, a Deep Learning ensemble approach should be discovered and tested. Different approaches for classification could also be tested using this concept. After proving the validity of this concept, future studies can integrate large-scale datasets with the proposed ensemble approach.

Conclusions
This research introduced an ensemble model that enhances the weighted voting ensemble by using the accuracy per class for each base ensemble to determine the best classification for FR in six classes to enhance accuracy and availability. This model was developed using the ML classifiers of SVM, SVC, Naïve Bayes, Decision Tree, and Logistic Regression. Using the five classifiers gave the same accuracy (99.45%) with the proposed ensemble when using only the ML base classifiers with the highest accuracies (SVM, SVC, and Logistic Regression); only the time decreased, which indicates improvement when using a smaller number of classifiers. The proposed approach was compared to other existing ensemble approaches on the created dataset, and our technique offered the best accuracy among them, with 99.45%. The required time was not the best in our model but was still acceptable as a classification time for automated requirement analysis tools, as it did not reach one second (0.7 s).
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.