A New Approach to Calibrating Functional Complexity Weight in Software Development Effort Estimation

: Function point analysis is a widely used metric in the software industry for development effort estimation. It was proposed in the 1970s, and then standardized by the International Function Point Users Group, as accepted by many organizations worldwide. While the software industry has grown rapidly, the weight values speciﬁed for the standard function point counting have remained the same since its inception. Another problem is that software development in different industry sectors is peculiar, but basic rules apply to all. These raise important questions about the validity of weight values in practical applications. In this study, we propose an algorithm for calibrating the standardized functional complexity weights, aiming to estimate a more accurate software size that ﬁts speciﬁc software applications, reﬂects software industry trends, and improves the effort estimation of software projects. The results show that the proposed algorithms improve effort estimation accuracy against the baseline method.


Introduction
Software estimation has long been considered a core issue that directly affects success or failure. According to the Standish Group [1], the failure rate of a part of a project or of a whole project is likely to be up to 83.9% (as of 2019). One of the reasons for this failure is inaccurate cost and effort estimates. In fact, to obtain software projects, companies participating in tenders must submit bids that include cost, manpower, and software development time. To be able to win the tender, the companies participating need to give a reasonable estimate of the cost, manpower, and time required to carry out the project. Reasonability here does not mean underestimating the price, because in so doing the company will not gain (if not lose) when completing the project. It is also not reasonable to overestimate the price, because then it is certain that the company will not win the bid. Therefore, a project estimate is considered reasonable only if it accurately reflects the project's actual value.
Throughout the software development process, no matter what software management model a company uses, project leaders often have to plan the work for software development milestones, plan the next milestone, and recalculate the work done in the previous milestone. All of these tasks require software estimation skills. Many methods have previously been proposed to solve the software estimation problem. Due to the increasing demand for more efficient and accurate estimation methods that can work with more complex software projects, such estimation methods need to be refined. Software estimation methods can be classified into three main groups: non-algorithmic, algorithmic, and machine learning approaches [2,3]. In the non-algorithmic category, there are two representative methods: expert judgement (EJ) [4] and analogy [5]; with these methods, experts play the most significant role in judgement. Of course, previous samples (historical dataset) also play another important role. In the algorithmic category, an algorithm takes

Problem Formulation
FPA has been used for over four decades, and has proven to be a dependable and consistent method [15] of sizing software for project estimation and productivity or efficiency comparisons. Although it has made significant contributions in the software industry, it still has many problems. In an earlier systematic literature review [16], we mentioned some limitations of FPA. The inadequacy of complexity weight is still a major problem. In addition, the locality of the dataset that builds the FPA approach (IBM projects) does not reflect the entire global software industry. Many previous studies have suggested a new functional complexity weight [17,18] in different ways. In [18], Xia et al. proposed a new functional complexity table based on an IFPUG FPA calibration model called Neuro-Fuzzy Function Point Calibration Model (NFFPCM), which integrates the power of neural networks and fuzzy logic. Nevertheless, the method needs to be changed in line with changes in the modern software industry.
Another issue that needs to be mentioned is the specificity of each piece of software. Differences in the purpose of the software being developed lead to different approaches. With FPA, a method that applies to all software estimation cases needs to be revisited; this was the motivation for us to propose a new, up-to-date, nonlocal functional complexity weight that reflects the software industry.
After counting the function points (FPs), we can calculate the effort and then write a report [9] based on these FPs. At this point, the FPA counting process is considered complete. However, one question is whether we can further improve the effort after the calculation of the counting process. Recent studies [19][20][21][22][23] show that combining an ensemble model and other approaches provides better results than using a single model. This study applies an ensemble model to the result after the counting and calculating effort to improve the accuracy gained from FP counting based on the proposed functional complexity weight.
Computers 2022, 11, 15 3 of 20 This study aims to propose an algorithm called the calibration of functional complexity weight (CFCW) algorithm. The proposed algorithm is based on the FPA method combined with regression methods implemented in the International Software Benchmarking Standards Group (ISBSG) dataset Release 2020 R1 [24].
Many companies around the world contribute to the ISBSG dataset, so the locality problem can be solved. In addition, the 2020 database update addresses the out-of-date issue. Many recent studies (for example, [25][26][27]) used the industry sector (IS) as a categorical variable for dataset segmenting in their research. Our study tested a second approach-the calibration of functional complexity weight with optimisation based on an ensemble model called the voting regressor (CFCWO).
Based on the above issues, we propose three research questions: RQ1: Is the accuracy of the proposed CFCW algorithm better than that of the IFPUG FPA or NFFPCM methods? RQ2: Does the advanced CFCWO algorithm outperform the CFCW algorithm? RQ3: How accurate is the estimation for each sector compared to an ungrouped dataset?
To answer these research questions, we conducted an experimental study to evaluate the estimation accuracy of the proposed approaches.

Contributions
The main contributions of this research are as follows:

•
In the first phase, a new CFCW algorithm for the calibration of functional complexity weight is proposed; • In the second phase, the result from the first phase is optimised by using a voting regressor to estimate the final software effort-the CFCWO algorithm is proposed; • The IFPUG FPA method is compared to the CFCW algorithm for ungrouped data and data grouped by IS; • The CFCW algorithm is compared to the CFCWO algorithm for ungrouped data and data grouped by IS.

Related Work
FPA is a standardised method for determining the size of software based on its functional requirements; it is designed to be applicable regardless of programming language or implementation technology. Albrecht [8] recommends FPA to measure the size of a system that processes data from end-users. Since its introduction, much research has been carried out to improve its accuracy.
Al-Hajri et al. [17] introduced a modification weighting system for measuring FP using an ANN model (backpropagation technique). In their study, a weighting system was built based on four steps: (1) using the original weighting system as a baseline to establish new weights; (2) using the DETs/RETs of the original system's FPA to calculate the new values, training these new values with an ANN, and then predicting the values of the new weights; (3) applying the new weights and the original weights in the FPA model; and (4) calculating the size of FPs as a function of the original weights and the new weights. Wei et al. [28] proposed a different sizing approach by integrating the new calibrated FP weight proposed in [18] into a complexity assessment system for objectoriented development effort estimation. Misra et al. [29] proposed a metrics suite that helps in determining the complexity of object-oriented projects by evaluation of message complexity, attribute complexity, weighted class complexity, and code complexity.
Dewi et al. [30] produced a formula for estimating the cost of software development projects, especially in the field of public service applications; the authors modified the complexity adjustment factor to 16 instead of the 14 used in the standard FPA method and; as a result, the accuracy was improved by 7.19%.
In the study of Leal et al. [31], the authors investigated the use of nearest-neighbours linear regression methods for estimation in software engineering. These methods were compared with multilayer perceptron neural networks, radial basis function neural net- works, support-vector regression, and bagging predictors. The dataset used in the study was a NASA software project. Based on the relative error and the estimation rate, the nearest-neighbours linear regression methods outperformed the others.
In a survey of applying ANNs to software estimation, Hamza et al. [32] provided an overview of the use of ANN methods to estimate development effort for software development projects; the authors offered four main ANN models, including (1) feedforward neural networks; (2) recurrent neural networks; (3) radial basis function networks; and (4) neuro-fuzzy networks. The survey also explains why those methods are used and how accurate they are.
In the endeavour of estimating the effort needed for the next phase or the remaining effort needed to finish a project, Lenarduzzi et al. [33] conducted an empirical study on the estimation of software development effort. The estimation was broken down by phase so that estimation could be used throughout the software development lifecycle. This means that the effort needed for the next phase at any given point in the software development lifecycle is estimated. Additionally, they estimated the effort required for the remaining part of a software development process. The ISBSG dataset was used in the study. The results show statistically significant correlations between effort expended in one period and effort expended in the following period, effort expended in one period and total effort expended, and accumulated effort up to the present stage and remaining effort. The results also indicate that these estimation models have different degrees of goodness of fit. Further attributes, such as the functional size, do not significantly improve estimation quality.
In [25,34], the authors presented an influence analysis of selected factors (FP count approach, business area, IS, and relative size) on the estimation of the work effort for which the FPA method is primarily used. They also studied the factors that influence productivity and the productivity estimation capability in the FPA method. Based on these selected factors and experimentally, the authors proved that the selected factors have specific effects on work effort estimation accuracy.
In [35], from software features, the authors used various machine learning algorithms to build a software effort estimation model. ANNs, support-vector machines, K-star, and linear regression machine learning algorithms were appraised on a PROMISE dataset (called Usp05-tf) with actual software efforts. The results revealed that the machine learning approach could be applied to predict software effort. In the study, the results from the support-vector machines were the best.
In [36], the authors conducted a comparison between soft computing and statistical regression techniques in terms of a software development estimation regression problem. Support-vector regression and ANNs were used as soft computing methodologies, and stepwise multiple linear regression and log-linear regression were used as statistical regression methods. Experiments were performed using the NASA93 dataset from the PROMISE software repository, with multiple dataset pre-processing steps performed. The authors relied on the holdout technique associated with 25 random repetitions with confidence interval calculation within a 95% statistical confidence level. The 30 pre-evaluation criteria were used to compare the results. The results of the study show that the support-vector regression model has a significant impact on precision.

IFPUG FPA
Albrecht [8] first introduced FPA in 1979, and presented the FP metric to measure the functionality of a project. This was proposed in response to a number of problems with other system size measures, such as lines of code. In 1986, the International Function Point User Group (IFPUG) [37] promoted and popularised effective software development and maintenance management through FPA.
The IFPUG is currently the governing body for FPA, and is responsible for improving and developing counting rules and other related matters. Since the IFPUG was created, the original FPA method has been known as the IFPUG's FPA. In this study, the standard  [38]; this standard specifies a set of definitions, rules, and steps for application [9]. There are six phases in counting standards; in this study, we are only interested in two phases: (1) data function and transactional function measurement, and (2) functional size measurement.
The first-phase results are the unadjusted function points (UFPs) and the value adjustment factor (VAF) values. The UFP value can be determined based on estimations of the number of transactional functions (external input (EI), external output (EO), or external inquiry (EQ)) and data functions (internal logic files (ILFs), and external interface files (EIFs)). These components are called base functional components. Each of these, in turn, is judged as low (L), average (A), or high (H), and assigned a weight accordingly. Table 1 shows the available complexity weight of the components. The UFP total sets the number of types in groups, multiplies them by complexity weights, and finds the sum of all fields, as in Equation (1): where S ij represents the total of each functional component, W ij represents the complexity weights, n is the number of types, and m is the number of complexity groups. The VAF count is based on the rate of 14 general system characteristics (GSCs): data communications, distributed data processing, performance, heavily used configuration, transaction rate, online data entry, end user efficiency, online update, complex processing, reusability, installation ease, operational ease, multiple sites, and facilitate change.
There are six influence levels of GSC criteria, with the system being determined as a value from 0 to 5 contingent on the level: 0-no influence; 1-incidental influence; 2moderate influence; 3-average influence; 4-significant influence; and 5-strong influence throughout. The VAF count is adjusted as follows: The second-phase result is the adjusted function points (AFPs) value, which can be obtained using Equation (3): To estimate the effort after AFP counting, we should use another parameter. The productivity factor (PF) was described as the relationship between one FP and the number of hours needed for its development by one person. Productivity and PF were studied in [39,40]. The following formula can be used to calculate the effort using the PF: ISBSG uses the productivity delivery rate (PDR) as a metric for efficiency. The PDR is measured in person-hours per FP. From the PDR, we can derive the PF in FPs per person-hour. We can see that the PDR is the inverted value of the PF (and vice versa) [9]. In our study, the IFPUG FPA method is the base method for proposing the new model; it is also used for the base compared with the proposed model. Additionally, the terms FPA and IFPUG FPA have the same meaning, and are interchangeable.

Bayesian Ridge Regression Model
The full Bayesian regression inference uses the Markov chain Monte Carlo algorithm to construct models [41]. The Bayesian modelling framework has been reputed for its ability to deal with a hierarchical data structure. In Bayesian regression techniques, regularisation parameters can be included in the estimation procedure. A regularisation parameter is not hard set, but is tuned to the data at hand. This can be done by introducing noninformative priors over the hyperparameters of the model. The l 2 regularisation used in ridge regression and classification is equivalent to finding a maximum a posteriori estimate under a Gaussian prior over the coefficients with precision. Instead of setting the lambda manually, this variable can be randomly estimated from the available data [42]. To acquire a fully probabilistic model, the output y is assumed to be Gaussian distributed around Xω: where α is treated as a random variable that is to be estimated from the data. Bayesian ridge regression (BRR) is a probabilistic method that builds a regression model using Bayesian inference; it combines prior information about parameters (the coefficient of software features) with the observed training data in order to acquire the parameters' posterior distribution [42]. The prior for the coefficient ω is specified by a spherical Gaussian: The priors over α and λ are picked to be gamma distributions [43]-the conjugate prior for the precision of the Gaussian distribution.
In our study, the BRR plays a significant role in the calibration phase (see Figure 1).

Voting Regressor Model
The ensemble is a learning method that uses a specific aggregation mechanism to create a collection of prediction models, and then uses a weighted vote of their initial results to obtain the final solution [44][45][46][47][48]. The principal premise is that if techniques work together as a committee with reliable methods, they may be improved and generate more significant results [44]. As a result, this method is excellent for predicting software effort,

Voting Regressor Model
The ensemble is a learning method that uses a specific aggregation mechanism to create a collection of prediction models, and then uses a weighted vote of their initial results to obtain the final solution [44][45][46][47][48]. The principal premise is that if techniques work together as a committee with reliable methods, they may be improved and generate more significant results [44]. As a result, this method is excellent for predicting software effort, since each model has its assumptions and setup parameters, allowing the ensemble to perform exceptionally well with some desirable statistical qualities [45]. Idri et al. [22,23] conducted a systematic literature review and mapping study, and discovered that (1) ensemble effort estimation techniques are more accurate than solo methods, (2) homogeneous ensembles are the most investigated, (3) machine learning techniques are the most used solo techniques to construct ensembles, and (4) there are two types of combiner rules used to estimate ensemble effort estimation: linear and nonlinear.
A voting regressor [46] is based on the idea of integrating various machine learning approaches to return uniform average projected values. A voting regressor is a technique that fits each of the base regressors to the entire dataset. A regressor such as this can help a group of estimators with similar performance levels to balance out their individual flaws. When the predictors are as independent as possible, ensemble approaches perform best. In general, each regressor is trained using a distinct technique, in order to make each prediction more independent of the others. This increases the likelihood that they will make a variety of blunders, which will improve the ensemble's performance. A voting regressor can be applied for classification or regression. Each label's predictions are combined with regard to classification, and the label with the most votes is chosen. In the case of regression, this entails computing the mean of the predictions from the models. According to Witten et al. [47], a voting ensemble is appropriate when all applicable models should perform well on a predictive modelling task; in other words, the models used in the ensemble must mostly agree.
In our study, the voting regressor was used in the CFCWO algorithm in the optimization phase (Figure 1).

Research Methodology
In this section, we present the research methodology. This includes describing the data to be used, along with the data processing and the experimental setup. In addition, evaluation criteria are introduced here.

Experimental Setup
In this section, we describe the experimental process, which is graphically illustrated in Figure 1.
In the data pre-processing phase (Figure 1), data filtering and cleaning were performed to create the working dataset (see following section). This dataset was used for two branches of experiments: experiments on ungrouped data (all sectors), and experiments on grouped data, where IS categorical variables were used for grouping. The fivefold cross-validation was used to create a training/testing fold. The dataset we used in our experiments was the ISBSG repository August 2020 R1 [24]. In our study, the criteria for data filtering were as follows:
Only the records where the data quality rating is A or B has been selected; 3.
The development type was new development; 4.
The rows with an empty value of base functional components were eliminated; 5.
Rows with empty values in the industry sector column were removed; 6.
The rows with empty values in normalised productivity delivery rate and summary work effort (SWE) were also erased; 7.
We filled the VAF blank cells with the values obtained from Equation (3). According to Lichtenberg andŞimşek [49], the number of records in a dataset is large enough to be eligible for a given training set to attain the most satisfactory results. M. Hammad [35] also proved that some algorithms learn perfectly as the size of the training set increases. In our case, after many tests and evaluations, the results from ISs with over 30 records gave the best results. For the ISs that did not satisfy this condition (the number of records is less than 30), we gathered them into a group named "Others". Figure 2 shows a histogram of the dataset after being processed. IFPUG 4+); 2. Only the records where the data quality rating is A or B has been selected; 3. The development type was new development; 4. The rows with an empty value of base functional components were eliminated; 5. Rows with empty values in the industry sector column were removed; 6. The rows with empty values in normalised productivity delivery rate and summary work effort (SWE) were also erased; 7. We filled the VAF blank cells with the values obtained from Equation (3).
According to Lichtenberg and Şimşek [49], the number of records in a dataset is large enough to be eligible for a given training set to attain the most satisfactory results. M. Hammad [35] also proved that some algorithms learn perfectly as the size of the training set increases. In our case, after many tests and evaluations, the results from ISs with over 30 records gave the best results. For the ISs that did not satisfy this condition (the number of records is less than 30), we gathered them into a group named "Others". Figure 2 shows a histogram of the dataset after being processed. We notice that there is a possibility that the data may be noisy because some SWE values are too far from the mean group. In this study, we used the interquartile range (IQR) method [50,51] on these features to determine and remove outliers, with the lower boundary being 0.15 and the upper boundary being 0.85. Figures 3 and 4 show the description of the dataset before and after the removal of outliers, respectively. We notice that there is a possibility that the data may be noisy because some SWE values are too far from the mean group. In this study, we used the interquartile range (IQR) method [50,51] on these features to determine and remove outliers, with the lower boundary being 0.15 and the upper boundary being 0.85. Figures 3 and 4 show the description of the dataset before and after the removal of outliers, respectively.    The calibration phase (Figure 1) represents the CFCW algorithm. The CFCW works as follows: 1. Bayesian ridge regression (Section 4.2) is employed; The calibration phase (Figure 1) represents the CFCW algorithm. The CFCW works as follows: 1.
CFCW elicits the complexity weights for the EI, EO, EQ, EIF, and ELF variables using Bayesian ridge regression; 3.
The UFP is calculated by using a newly estimated complexity weight for each of the variables (EI, EO, EQ, EIF, and ELF); 4.
Estimated effort is obtained by multiplying UFP by the VAF and, finally, by multiplying by PF.
The optimisation phase (Figure 1) represents the CFCWO algorithm, which works as follows:
Voting regressor is an ensemble model, consisting of four estimators ( Table 2); 4.
CFCWO optimises estimated effort by CFCW by minimising the error to SWE (know effort from dataset). Tested Models The CFCW model and CFCWO model were tested on datasets with and without IS filtering. The variant called "all sectors" used the whole dataset without IS grouping. The IS was used as a grouping variable, allowing us to test both models on each IS independently (as seen in the Results section). The following is a brief description of the compared models: • CFCW-effort is computed using the IFPUG approach; complexity weights are estimated by Bayesian ridge regression; PF (PDR) is the mean from all ISs or based on each IS; • CFCWO-effort is estimated using a trained voting regressor, where the regressor is the effort value from CFCW, and the dependent variable is the SWE value (from the dataset); again, variant for all sectors and per sector were tested.
We performed a process using the voting ensemble model with four base estimators in the estimation effort optimization. These estimators and their parameters are described in Table 2.
CFCWW and CFCWO were compared to the following models: • IFPUG FPA [37]-effort is computed using the IFPUG approach; IFPUG-based complexity weights and PF (PDR) from the dataset (mean from all sectors or based on each sector); • NFFPCM [18]-effort is computed using IFPUG approach; complexity weight from the study of Xia et al. and PF (PDR) from the dataset (mean from all sectors or based on each sector).

Evaluation Criteria
Regarding measurement accuracy, according to Foss et al. [52], there have been many investigations and evaluations of the suitability of error functions proposed and used thus far. However, there is no universal solution to the problem of choosing good predictive models from among several alternatives; this means that each accuracy indicator has certain advantages over the others. Kitchenham et al. [53] pointed out that the crucial factor for meaningful comparisons between predictive models is identifying what each error function uses for actual measurements. This study uses the most common and widely used measures to evaluate the predictability of comparative models and the accuracy of the proposed model. MAE (mean absolute error) [54]: MSE (mean squared error) [55]: RMSE (root-mean-square error) [56]: MAPE (mean absolute percentage error) [57]: where y i is the actual value,ŷ i is the predicted value, and N is the number of projects.

Results and Discussion
In this section, we evaluate the accuracy of the proposed CFCW and CFCWO from the experimental results. We compare the IFPUG FPA and NFFPCM models to CFCW and CFCWO algorithms. All experiments for all sectors and for data grouped by IS were calculated.
The calibrated functional complexity weight values obtained from the experiment are listed in Table 3. According to the values of the individual parameters EI, EI, EQ, ELF, and ILF, the calibrated values of the scales differ from the original values. The minimum percentage deviation is approximately 2% on an ungrouped dataset, while the maximum deviation is nearly 242% against standard weights. The individual ISs show an even greater variance in deviations.  Figure 5 shows a comparison of efforts estimated by the IFPUG FPA, NFFPCM, CFCW, and CFCWO methods versus the real SWE. The effort estimated by the CFCWO approach was closest to the SWE (in all cases). This means that the proposed CFCWO approach also outperforms the IFPUG FPA, NFFPCM, and CFCW methods.  Figure 5 shows a comparison of efforts estimated by the IFPUG FPA, NFFPCM, CFCW, and CFCWO methods versus the real SWE. The effort estimated by the CFCWO approach was closest to the SWE (in all cases). This means that the proposed CFCWO approach also outperforms the IFPUG FPA, NFFPCM, and CFCW methods.  Table 4 shows the MAE value when using the IFPUG FPA, NFFPCM, CFCW, and CFCWO algorithms, along with the percentage improvement value of the CFCW method compared to the IFPUG FPA method, and the percentage improvement value of the CFCWO algorithm compared to the CFCW algorithm. Accordingly, the lowest improvement value of the CFCW method compared to the IFPUG FPA method was in the government sector, with 6.48%, while the highest was in the communication sector, with 51.55%. Overall, the estimated effort by the CFCW algorithm improved by 5.46%, while that by the CFCWO algorithm was enhanced by 11.89%.  Table 4 shows the MAE value when using the IFPUG FPA, NFFPCM, CFCW, and CFCWO algorithms, along with the percentage improvement value of the CFCW method compared to the IFPUG FPA method, and the percentage improvement value of the CFCWO algorithm compared to the CFCW algorithm. Accordingly, the lowest improvement value of the CFCW method compared to the IFPUG FPA method was in the government sector, with 6.48%, while the highest was in the communication sector, with 51.55%. Overall, the estimated effort by the CFCW algorithm improved by 5.46%, while that by the CFCWO algorithm was enhanced by 11.89%. According to this MAE evaluation criterion, the NFFPCM does outperform IFPUG FPA in most sectors, except for communication. The CFCW results are always better than that of NFFPCM, and the CFCWO is the same. Figure 6 shows a comparison of the results from a visual perspective. The MAE of the CFCWO algorithm is always the smallest, indicating the best estimation accuracy.
Computers 2022, 11, 15 13 of 20 According to this MAE evaluation criterion, the NFFPCM does outperform IFPUG FPA in most sectors, except for communication. The CFCW results are always better than that of NFFPCM, and the CFCWO is the same. Figure 6 shows a comparison of the results from a visual perspective. The MAE of the CFCWO algorithm is always the smallest, indicating the best estimation accuracy.  Table 5 shows the percentage difference based on the MAE evaluation criteria for each IS compared to the same algorithm applied to all sectors.  Table 6 shows the MAPE evaluation values of the proposed approach and the improvement of the CFCW and CFCWO methods compared to the IFPUG FPA method.  Table 5 shows the percentage difference based on the MAE evaluation criteria for each IS compared to the same algorithm applied to all sectors. Table 6 shows the MAPE evaluation values of the proposed approach and the improvement of the CFCW and CFCWO methods compared to the IFPUG FPA method. Each value in the CFCW column is always smaller than that in the FPA column, and each value in the CFCWO column is smaller than that in the CFCW column. This means that the CFCW method is better than the FPA method, and the CFCWO algorithm is better than the CFCW algorithm. The superiority of the CFCW and CFCWO methods in comparison to the IFPUG FPA method is shown in the last two columns. Accordingly, the progress of the CFCW algorithm vs. the FPA algorithm for all sectors is 4.01% (for individual sectors, the minimum value is 1.08% in the government sector, while the maximum value is 37.51% in the communication sector). Finally, the superiority of the CFCWO method compared to the CFCW method for all sectors is 6.56% (for individual sectors, the minimum value of 4.10% is also in the government sector, while the maximum value of 14.67% is in the banking sector).  Figure 7 shows a visual representation of the results. For the CFCWO method, the values are always less than the others, indicating the most accurate estimation.    Table 7 shows the percentage difference based on the MAPE evaluation criteria for each IS compared to the same algorithm applied to all sectors.   Table 7 shows the percentage difference based on the MAPE evaluation criteria for each IS compared to the same algorithm applied to all sectors. In the same way, the results for the RMSE evaluation criterion were compared (see Tables 8 and 9, as well as Figure 8). The only difference was that the lowest improvement value of the CFCW method compared to the IFPUG FPA method was in the government sector, at 8.77%, while the highest was in the communication sector, at 54.15%. Overall, by using the CFCW algorithm, the estimated results improved by 10.39%, and the CFCWO algorithm enhanced the estimate by 24.62%. The NNFPCM model's results, in this case, were better than those of the IFPUG FPA method in some sectors (banking, communication, government, and insurance).   Based on these statements and the analysis of previous results, we can proceed to answer the research questions.
RQ1: Is the accuracy of the proposed CFCW algorithm better than that of the standard IFPUG FPA or NFFCPM methods?
For each evaluation criterion shown in Tables 4, 6 and 8, we can observe that the value in the CFCW column is always smaller than the value in the corresponding IFPUG FPA or NFFPCM column. Figures 6-8 To answer this research question, we also evaluated Tables 4, 6 and 8. As we can see, the percentage improvement of the CFCWO method in terms of accuracy compared with the CFCW method was MAE = 11.89%, MAPE = 6.56%, and RMSE = 24.62% for all sectors. As in the previous comparison, there was a greater improvement in the accuracy of the CFCWO algorithm estimate for each sector. The largest percentage differences were MAE = 22.18% (financial sector), MAPE = 18.53% (communication sector), and RMSE = 26.77% (banking sector). The mean percentage differences of the individual sectors compared to the ungrouped dataset were MAE = 11.08%, MAPE = 9.59%, and RMSE = 12.28%. RQ3: How accurate is the estimation for each sector compared to an ungrouped dataset?
The answer to this question is shown in Tables 5, 7 and 9. As we can see, the accuracy of the estimate in all individual sectors for all evaluation criteria is higher than for an ungrouped dataset.
When CFCW is compared to IFPUG FPA, improvement in the accuracy of the CFCW estimate can be seen for the individual sectors, where the largest percentage differences are MAE = 51.55% (communication sector), MAPE = 30.76% (banking sector) and RMSE = 54.15% (communication sector); the mean percentage differences of the individual sectors compared to the ungrouped dataset are MAE = 22.52%, MAPE = 16.26%, and RMSE = 27.75%.
When CFCW is compared to NFFCMP, improvement in the accuracy of the CFCW estimate can be seen for the individual sectors, where the largest percentage differences are MAE = 72.00% (service industry), MAPE = 67.69% (financial), and RMSE = 71.79% (service industry); the mean percentage differences of the individual sectors compared to the ungrouped dataset are MAE = 49.98%, MAPE = 49.03%, and RMSE = 38.25%.
Paired-samples t-tests were used for evaluating statistical significance comparisons [58,59] to see whether the CFCWO method is significantly different from the other methods, in order to confirm the evaluation conclusions (see Table 10). The notations , , and ≈ reflect the statistical superiority, inferiority, and similarity of the CFCWO approach compared to each of the other methods (FPA and NFFPCM), respectively. We can conclude that the difference in estimating accuracy between the CFCWO and each alternative approach is significant when the p-value is less than 0.05. All used evaluation criteria results in this study were used as the sample test set for each method in this study.

Threats to Validity
Internal validity in this study, which can affect the validity of conclusions drawn from experimental research, is an incorrect/inaccurate evaluation method to assess the proposed method; specifically, it refers to the technique of statistical sample validation. The threat to the validity was controlled using the k-fold cross-validation method, which guarantees that the proposed method is accurately assessed. Another internal threat that may affect the validity of the obtained results is the choice of parameters in the machine learning technique. In this study, we use the default parameter settings of the Bayesian ridge regressor technique for the proposed algorithm.
External validity in this study is concerned with the range of validity of the results obtained, and whether the results obtained could be applied in a different context. The ISBSG repository August 2020 R1 dataset was used to assess the predictive ability of the proposed method. This dataset contains many software projects collected from different organisations worldwide that differ in terms of features, fields, size, and number of features.
Unbiased evaluation criteria are used to evaluate the performance accuracy of the proposed method. This study used evaluation criteria such as the MAE, MAPE, and RMSE, which are unbiased evaluation criteria according to previous research [60,61]. Therefore, we can conclude that the experimental results of this study are highly generalizable.

Conclusions and Future Work
A standard IFPUG FPA method calibration algorithm based on the Bayesian ridge regressor model for calibration (CFCW) and the voting regressor model for optimising effort estimation (CFCWO) with and without dataset grouping is presented in this study. This paper aimed to answer three research questions: In answer to RQ1, we can see a percentage accuracy improvement with the proposed CFCW algorithm compared to the IFPUG FPA method, depending on the evaluation criteria and whether a grouped or ungrouped dataset was used. For the ungrouped dataset, the percentage accuracy improvement for MAE = 5.46%, MAPE = 4.10%, and RMSE = 10.39%. The mean percentage difference of the individual sectors compared to the ungrouped dataset was MAE = 22.52%, MAPE = 16.26%, and RMSE = 27.75%, showing an even greater improvement in the accuracy of the estimates. This demonstrates that the IFPUG FPA method needs calibration, and can be calibrated. When CFCW is compared to NFFCMP, MAE = 15.39%, MAPE = 28.41%, and RMSE = 19.9% for all sectors.
The second proposed algorithm, CFCWO, brings further improvement, and outperforms the CFCW algorithm, answering RQ2. The percentage improvement varies according to the evaluation criteria and dataset. For the ungrouped dataset, the percentage accuracy improvement is MAE = 11.89%, MAPE = 6.56%, and RMSE = 24.62%. The mean percentage difference of the individual sectors compared to the ungrouped dataset is MAE = 11.08%, MAPE = 9.59%, and RMSE = 12.28%. The results also show that it makes sense to work with data belonging to a specific group. In our case, we grouped the data according to the IS. The answer to RQ3 is that the estimate's accuracy in all individual sectors for all evaluation criteria is higher than for an ungrouped dataset.
The functional complexity weight values reflect the modern software industry trend of improving work performance thanks to the development of computer technology, programming languages, and CASE tools. This manifests itself in functional complexity weight values that are smaller than the original value. In addition, the demand for sophistication and complexity of software functions also increases over time in certain areas, manifesting in calibrated functional complexity weight values that are more significant than the original values.
IFPUG FPA is a calculation method that estimates the size, cost, and effort in the field of software development; it plays a significant role in today's software industry. However, software engineering is a rapidly evolving field; today's actual values may not accurately reflect tomorrow's software values. Therefore, the weights proposed in this paper need to be updated according to the new trend. The ISBSG dataset is an up-to-date database of companies around the globe; it reflects the modern software industry that is constantly updated. Therefore, in the future, when project data are updated, the IFPUG FPA weighting values should be recalibrated to reflect the latest software industry trends.

Data Availability Statement:
The ISBSG data used to support the findings of this study may be released upon application to the ISBSG, which can be contacted at admin@isbsg.org or http://isbsg. org/academic-subsidy (accessed on 20 September 2021).