A Novel Data Analytics Method for Predicting the Delivery Speed of Software Enhancement Projects

: A fundamental issue of the software engineering economics is productivity. In this regard, one measure of software productivity is delivery speed. Software productivity prediction is useful to determine corrective activities, as well as to identify improvement alternatives. A type of software maintenance is enhancement. In this paper, we propose a data analytics-based software engineering algorithm called search method based on feature construction (SMFC) for predicting the delivery speed of software enhancement projects. The SMFC belongs to the minimalist machine learning paradigm, and as such it always generates a two-dimensional model. Unlike the usual data analytics methods, SMFC includes an original algorithmic training procedure, in which both the independent and dependent variables are considered for transformation. SMFC prediction performance is compared to those of statistical regression, neural networks, support vector regression, and fuzzy regression. To do this, seven datasets of software enhancement projects obtained from the International Software Benchmarking Standards Group (ISBSG) Release 2017 were used. The validation method is leave-one-out cross validation, whereas absolute residuals have been chosen as the performance measure. The results indicate that the SMFC is statistically better than statistical regression. This fact represents an obvious advantage in favor of SMFC, because the other two methods are not statistically better than SMFC.


Introduction
Economics is the study of value, costs, resources, and their relationship in a given context or situation, whereas software engineering economics involves decision making related to software engineering in a business context. In spite of software engineering economics being concerned with aligning software technical decisions with the business goals of the organization, in many companies, software business relationships to software development and engineering remain vague.
The software engineering economics fundamentals are finance, accounting, controlling, cash flow, decision-making process, valuation, inflation, depreciation, taxation, time-value of money, efficiency, effectiveness, and productivity. Productivity has been defined as the ratio of output over input from an economic perspective (i.e., maximizing productivity is about generating the highest value with the lowest resource consumption). Output is the value delivered, whereas input covers all resources spent to generate the output [1]. machine learning paradigm, while Section 4 describes in detail the central proposal of this article (SMFC), carefully exemplifying each of the four steps of the new model. Some considerations about the complexity of the algorithm are also included. Section 5 is of utmost importance because it describes the criteria observed to select the data sets of software enhancement projects. Section 6 presents the experimental results, whereas Section 7 presents the discussion, conclusions, and the limitations of our proposal, as well as the future work.

Related Work
This section briefly mentions some of the few articles dealing with software delivery speed, and the different topics that are covered in relation to this topic. In addition, very brief descriptions of the three models mentioned in the introduction are included, against which our proposal is compared.

Delivery Speed
Delivery speed is a subject that is rarely covered in the scientific literature of software engineering. This topic has recently been studied from several approaches, such as its influence on globally distributed projects [21], its relationship with quality improvement [22], as well as the value aspect in agile software development organizations [23].
Moreover, the influence of some factors on delivery speed has been analyzed such as reusing [24], the application of automated toolchains, and of agile practices such as Kanban, Scrum, and Extreme Programming (XP) [25].
It is highly pertinent to emphasize the fact that in the extensive documentary research carried out during the development of this paper, the authors did not find any studies related to delivery speed prediction. The relevance of this fact is that this completely justifies the originality of the approach given to the research reported herein. The good results presented in this paper open up a new vein of scientific research: delivery speed prediction using data analytics methods.
In accordance with a systematic review of studies on effort prediction published in 2012 based on machine learning models, neural networks and support vector regression (SVR) reported the best prediction performance [26], whereas according to another systematic review recently published in 2018, their application remains of relevance [27]. Considering this, and in addition to the application of MLR, in this article a multilayer feedforward perceptron (MLP) neural network, and two types of SVR have also been applied to be compared to SMFC performance.

MLR and FR
Statistical regression is a very popular technique that is used in applications in many and diverse areas of science and engineering. Specifically in software engineering, statistical regression is the usual technique when it is required to perform regression of functions. In fact, one of the conditions to take into account any new proposal in software engineering is that the performance should at least outperform that of statistical regression [20].
If when solving a regression problem there is a dependent variable y depending on two or more independent variables x 1 , x 2 , . . . , x k , statistical regression is called multiple linear regression (MLR). It is assumed that variable y is a linear function of the independent variables x 1 , x 2 , . . . , x k . The linear relationship between the independent variables and the dependent variable is modeled, and this modeling is represented by Expression (1) [28]: where b 0 , b 1 . . . , b k are constants whose value must be adjusted according to the data of the problem under study.
If we restrict ourselves to the case where there are two independent variables and apply a logarithmic transformation, Expression (1) becomes Expression (2), where the new constants are a, b, c: ln(y) = a + b ln(x 1 ) + c ln(x 2 ) (2) In Equation (1), which represents the MLR model, it is considered that b 0 , b 1 . . . , b k are constants whose value must be adjusted. If these parameters are substituted with fuzzy intervals, then the model becomes fuzzy regression (FR), which is the fuzzy version of MLR. FR determines a fuzzy linear relationship between the independent variables and the dependent variable [28].

MLP
A neural network (NN) deals with real-world problems that are nonlinear. It is defined from its learning paradigm, learning algorithm, and topology. A learning paradigm can either be supervised, unsupervised, or a reinforcement process; where the supervised one is commonly used for classification and prediction applications, the unsupervised one for data clustering and segmentation, and reinforcement is usually applied in optimization over time and adaptive control. In the supervised paradigm, the learning algorithm calculates the difference between the correct output and the actual prediction generated from the neural network, then this difference is used for adjusting the weights of the NN such that next time, the prediction is closer to the desired output [29].
The manner of how the neural processing units named neurons and their interconnections are related influences on the NN performance. An NN has a set of neurons which receive inputs from the outside world. These neurons are known as input units. Moreover, an NN has one or more hidden layers also consisting of neurons that receive inputs from other neurons. Each layer receives a vector of data or the outputs of a previous layer of neurons for processing them in parallel. The neuron representing the final result of the NN is called output unit [26].
Regarding topology, feedforward, limited recurrent, and fully recurrent network are three types of connection topologies that define how the data flow between the input, hidden, and output neurons. A connection topology does not refer to any specific type of activation function or training paradigm. In our study, a multilayer feedforward perceptron (MLP) with back propagation learning algorithm is applied since it has been the most commonly used algorithm and successfully applied to effort prediction [27]. An MLP uses a feedforward topology, supervised learning, and the back-propagation learning algorithm.
An MLP can have a layer of input neurons, one or more layers of hidden neurons, and finally a layer of output neurons. In this study, an MLP with a single hidden layer of neurons is used since it can model any continuous function to any degree of performance.
In an MLP, the data flow through it in one direction, and the response is based on the current set of inputs. For the current study, the size of projects and the number of developers by project enter the MLP through the input neurons. The input values of software projects are assigned to the input neurons as the unit activation values. The output value of the neuron is modulated by means of connection weights. A threshold value is used by each neuron to combine all of the input signals. This input signal is passed through an activation function to determine the actual output of the neuron.
The type of activation function suggested for hidden layer neurons has been non-linear because of its capacity for learning non-linear relationships among variables. The alternative most frequently used function in literature for such problems is the sigmoid function, which converts an input value to an output ranging from 0.0 to 1.0 [29]. Since this study is related to prediction, the type of activation function for output layer is linear. The sigmoid and piecewise-linear functions used here are described in the following equations.
where v is the internal state of the neuron, which is calculated by summing the inner product of the input vector, the weight vector, and a bias value, whereas y = Φ(v) corresponds to the output of the neuron.

SVR
An SVM has a main concept to distinguish items into two groups. This model seeks to find the optimal hyperplanes determined by support vectors for linearly separable classes [30]. Equation (4) describes an SVM in the plane, where w is the normal vector to the separation line satisfying for the point x, whereas b refers to bias (a measure of offset of the separation line from the origin).
Equation (5) is used to find the optimal line from the Lagrange multipliers indicated by α i , where the training observation values y i are either 1 or −1.
The result is shown below, with a w vector corresponding to a linear combination of the support vectors x i [31].
When the classes are not linearly separable, that is, in cases where there is no hyperplane that separates the two classes, a "soft margin" is proposed by seeking to minimize the number of errors in the two splits and maximizing the margin between these splits. Slack variables are used to measure the degree of misclassification by data point through the training phase, a penalty function is used, and Lagrange multipliers are restricted by a parameter C (0 ≤ α i ≤ C). An SVM uses kernels assuming a linear separation by mapping data into a higher-dimensional space. The preferred ones in the state of the art are linear, polynomial, radial basis function, and Sigmoid function.
According to [32], a support vector regression (SVR) is a type of SVM applicable to regression tasks. There are two types of SVRs: ε−support vector regression (ε−SVR) and υ−support vector regression (υ−SVR). The former is trained based on a symmetrical loss function (i.e., ε−insensitive) that penalizes high and low misestimates. An SVR looks for a function f (x) having the most ε deviation from the target y i for the data x i . As for υ−SVR, it minimizes the ε−insensitive loss function and it uses new υ parameter (between 0 and 1) instead. This υ parameter controls the number of support vectors by allowing data compression and generalizing the prediction error bounds.

Basic Elements of the Minimalist Machine Learning Paradigm
The minimalist machine learning paradigm was recently presented as a response to a problem that afflicts the areas of machine learning and artificial intelligence [19]. Many of the most effective models used in these areas, especially the intelligent pattern classification models, are not explainable. That is, in order to achieve good performance in the classification of patterns, the models are less transparent and less explainable, because complicated algorithmic steps are included. Specialists say that these models and algorithms behave like "black boxes" [33].
The SVM model is a clear example, because its good performance is based on the kernel trick to achieve separability of the classes [30]. However, the use of the kernel brings problems: the patterns are transformed to be represented in a space of greater dimension than the original patterns, which obviously decreases the explainability and increases the complication, at the cost of better results.
With the new minimalist machine learning paradigm, it has been achieved that the models of this paradigm are capable of minimizing classification errors, but without becoming "black boxes".
The new paradigm is based on the strong assumption that it is possible to reduce any problem of pattern classification to a graphical problem on the Cartesian plane. This holds, regardless of how large the patterns dimension is.
The algorithms of this new paradigm are effective, transparent and explainable. Additionally, it should be noted that its high efficiency and effectiveness is due to only a few simple operations being used for both phases: learning and classification. Obviously, the interpretation of the results is immediate, because the user unambiguously and immediately detects the way in which the classification is carried out.
The idea is to convert, through a simple operation, all the features of a given pattern into a single real value, which will be located on the horizontal axis of the Cartesian plane. Then, through another simple operation, convert all the features of that same pattern into another real value, which will be located on the vertical axis of the Cartesian plane. Both values form an ordered pair whose graph is a point in the plane.
After performing the steps described in all the patterns of the classification problem to be solved, it is expected that two lines separated by a horizontal line will be formed.
The ideal case is one in which all the points of one class (C1) are above that horizontal line, while all the points of the other class (C2) appear below that same horizontal line. Figure 1 illustrates the case in which the two simple operations involved are the standard deviation and the mean. The SVM model is a clear example, because its good performance is based on the kernel trick to achieve separability of the classes [30]. However, the use of the kernel brings problems: the patterns are transformed to be represented in a space of greater dimension than the original patterns, which obviously decreases the explainability and increases the complication, at the cost of better results.
With the new minimalist machine learning paradigm, it has been achieved that the models of this paradigm are capable of minimizing classification errors, but without becoming "black boxes".
The new paradigm is based on the strong assumption that it is possible to reduce any problem of pattern classification to a graphical problem on the Cartesian plane. This holds, regardless of how large the patterns dimension is.
The algorithms of this new paradigm are effective, transparent and explainable. Additionally, it should be noted that its high efficiency and effectiveness is due to only a few simple operations being used for both phases: learning and classification. Obviously, the interpretation of the results is immediate, because the user unambiguously and immediately detects the way in which the classification is carried out.
The idea is to convert, through a simple operation, all the features of a given pattern into a single real value, which will be located on the horizontal axis of the Cartesian plane. Then, through another simple operation, convert all the features of that same pattern into another real value, which will be located on the vertical axis of the Cartesian plane. Both values form an ordered pair whose graph is a point in the plane.
After performing the steps described in all the patterns of the classification problem to be solved, it is expected that two lines separated by a horizontal line will be formed.
The ideal case is one in which all the points of one class (C1) are above that horizontal line, while all the points of the other class (C2) appear below that same horizontal line. Figure 1 illustrates the case in which the two simple operations involved are the standard deviation and the mean. The reader can find in [19] an example developed from beginning to end. The example applies these two simple operations to the patterns of a real cancer-related dataset. The results can be replicated with the support of a pocket calculator or modest computer equipment.
The reader will witness the power of the minimalist machine learning paradigm, when verifying the results. While two of the best classifiers in the state of the art show excellent performances (SVM: 92.85%, and MLP: 96.42%), with the minimalist machine learning paradigm model, 100% accuracy was obtained.
One might wonder if, besides standard deviation and mean, there will be other operations that are useful for this paradigm. The answer is that any operation that converts an array of numbers to a real number could be useful. In fact, some combinations of operations on subsets of features may also be useful. The minimalist machine learning paradigm has just been born and has opened up a host of novel veins of scientific research.
One might also wonder if for any dataset there is a horizontal line that separates the classes. The answer in this case is a resounding no. This would be a contradiction to the No Free Lunch Theorem [31]. Although for many datasets this separation line does not exist, in these cases it is no longer The reader can find in [19] an example developed from beginning to end. The example applies these two simple operations to the patterns of a real cancer-related dataset. The results can be replicated with the support of a pocket calculator or modest computer equipment.
The reader will witness the power of the minimalist machine learning paradigm, when verifying the results. While two of the best classifiers in the state of the art show excellent performances (SVM: 92.85%, and MLP: 96.42%), with the minimalist machine learning paradigm model, 100% accuracy was obtained.
One might wonder if, besides standard deviation and mean, there will be other operations that are useful for this paradigm. The answer is that any operation that converts an array of numbers to a real number could be useful. In fact, some combinations of operations on subsets of features may also be useful. The minimalist machine learning paradigm has just been born and has opened up a host of novel veins of scientific research.
One might also wonder if for any dataset there is a horizontal line that separates the classes. The answer in this case is a resounding no. This would be a contradiction to the No Free Lunch Theorem [31]. Although for many datasets this separation line does not exist, in these cases it is no longer sought to achieve zero errors, but rather to minimize the number of errors, which opens up other scientific research topics.

Our Proposal: Search Method Based on Feature Construction (SMFC)
The proposal introduced in this paper is a data-analytics-based method for predicting the delivery speed of software enhancement projects, which has three parts. The first part is the most important, because it gives the SMFC the character of a model belonging to the minimalist machine learning paradigm. This first part consists of a set of variable transformations, which allow the generation of a two-dimensional model. The importance of this model is that it describes the problem to be solved. The second part is a simple linear regression (SLR), while the third part consists of applying a metaheuristic search to optimize the parameters of the SLR model. Section 4.1 describes five illustrative cases to exemplify the process of transforming variables. It also includes all the variable transformations that are used in the proposal. Then, Sections 4.2 and 4.3 describe the SLR model and metaheuristics, respectively. In Section 4.4, the SMFC model as a four-step integral whole is explained, and also includes some considerations on the complexity of the algorithm.

Variable Transformation and Overview
The first novel characteristic of SMFC is that it always generates a two-dimensional model, in contrast to the behavior of machine learning techniques such as SVM, in which the input vectors are mapped into a highly dimensional space [34]. In this sense, the basic assumption is that as an advantage evident of this proposal, regardless of the number of predictors (independent variables), it is always possible to find a transformation that generates a two-dimensional model whose representation space is the Cartesian plane.
An additional original characteristic of SMFC consists of that, for its training, both independent and dependent variables are considered for transformation. This is remarkable due to the novelty of including the dependent variable into the transformed independent variable. Currently, the authors do not have any knowledge about any study taking advantage of this issue.
In the acronym SMFC, FC means "Feature Construction". In this model, those features are built by means of elemental transformations from the two types of variables: independent and dependent.
Let a problem in the software engineering field, whose set of involved variables, V, includes (without loss of generality), a dependent variable v d and a set V i of independent variables (i.e., predictors), whose cardinality can be higher than one as follows: The transformations are next described by means of illustrative cases. First illustrative case: The problem regarding the delivery speed (DS) of software enhancement projects of our study involves a dependent variable v d = DS, as well as a set of two independent variables V i = {UFP, MTS}, thus, the cardinality for the set of independent variables is two.
Let us now consider a finite set of n elemental transformations T = {τ 1 , τ 2 , . . . , τ n } with n ∈ Z + , where each τ 1 ∈ T can be either an arithmetic operation (involving the problem variables, and even some other real parameters), a linear function, a nonlinear function (such as trigonometric, logarithmic, or exponential functions), an elemental statistical operation, or another option.
The main objective of the FC is to select a collection of elemental transformations τ i and apply each of them to specific values of elements from the power set 2 |V| , such that a collection of points on the plane is obtained. This collection corresponds to pairs of specific values that involve dependent and independent variables of the problem.

Second illustrative case:
The following power set is obtained from the first illustrative case: Third illustrative case: There exists an infinite quantity of possibilities for selecting transformation combinations τ i with elements from 2 |V| , which can be combined from either real parameters or results of other transformations τ i applied to other elements from 2 |V| , such that a pair of specific values is obtained by case. For instance, if τ 1 was the sum of real numbers and then that τ 1 is applied to the two independent variables, then we would have τ 1 (UFP, MTS) = UFP + MTS; if it also happens that τ 2 is the power function of real numbers, we could apply τ 2 to the following two arguments: dependent variable (i.e., DS), and the result obtained from τ 1 (UFP, MTS) = UFP + MTS, which would result in: Fourth illustrative case: Now we must select those combinations that have the best fit regarding the specific problem we want to solve. This selection is achieved from that infinite quantity of transformation combinations τ i with elements from 2 |V| , described in the third illustrative case. SMFC includes the application of the selected transformations to those specific values corresponding to the selected variables under a convenient order, such as a set of pairs of values is obtained. An SLR model is then applied to this data set of pairs. Since the training data set of the problem to be solved consists of N software enhancement projects: then each set of specific values corresponds to one of the projects. For instance, consider the first software project having as specific values the following ones: DS 1 = 2.04; UFP 1 = 4.34 and MTS 1 = 1. 10.
In accordance with the concepts described in the previous four illustrative cases, the SMFC general algorithm consists of applying all the transformations τ 1 ∈ T to those specific values of the selected elements 2 |V| , as well as to the obtained values from some transformations in the proper order; that is, the transformation τ k ∈ T will be applied either to an element V k ∈ 2 |V| , or to any value arg k obtained from either the application of one or more transformations {τ 1 , τ 2 , . . . , τ k−1 } or from: This procedure is iteratively performed for all k values. So far, we have obtained the values for all the training data set in the transformation space. The fifth illustrative case will describe how the mentioned values are converted in a problem that allows one to apply an SLR model. Fifth illustrative case: N real values are obtained once the application of all the transformations defined in the third illustrative case are performed to the N software enhancement projects included in the training data set of the fourth illustrative case. It means that a specific result: is obtained for each project µ, where µ ∈ {1, . . . , N}.
was obtained for the first software project. The transformations τ 1 and τ 2 described in the third, fourth, and fifth illustrative cases were presented with an explanatory objective. The original model introduced in our study corresponds to a particular case of the SMFC general algorithm. SMFC is applied to the solution of the described problem, that is, to delivery speed prediction of software enhancement projects with UFP and MTS as the independent variables.
SMFC includes the following five transformations (two of them correspond to the "product of a real parameter by a variable" type): τ 1 : Product of the real parameter a by a variable; τ 2 : Product of the real parameter b by a variable; τ 3 : Arithmetic addition operation; τ 4 : Natural logarithm function (ln); τ 5 : Arithmetic product operation. These five transformations are applied to each µ ∈ {1, . . . , N} in the following order: τ 1 is applied to MTS µ for obtaining a · MTS µ ; τ 2 is applied to DS µ for obtaining b · DS µ ; τ 3 is applied to a · MTS µ and b · DS µ for obtaining a · MTS µ + b · DS µ ; τ 4 is applied to a · MTS µ + b · DS µ result for obtaining ln(a · MTS µ + b · DS µ ); τ 5 is applied to both UFP µ and ln(a · MTS µ + b · DS µ ) for obtaining UFP µ · ln(a · MTS µ + b · DS µ ). Finally, the following r µ result is obtained for each µ ∈ {1, . . . , N}:

Simple Linear Regression
One of the advantages of our proposal is that the application of the transformations leads to a two-dimensional model. This advantage is clearly reflected in the following fact: it is possible to apply an SLR model in order to fit the data to a line in the plane [35]. It contrasts with studies which implement multiple linear regressions due to the existence of multiple predictor variables.
The following step has high relevance for the SMFC: an independent variable is selected and its values are graphically represented on the X-axis of the transformation space, whereas the corresponding N values r µ are graphically represented on the Y-axis. In our study, the N values UFP µ correspond to those ones on the X-axis.
There is not a general rule to select any of the independent variables to be graphically represented on the X-axis from the transformation space, because of this, selection is one of the decisions to be taken into account for tuning the model parameters for the specific problem to be solved. This issue is similarly presented by tuning the parameters in SVM, neural networks or other models [32].
When an SLR model is applied to the following expression, the real parameters a and b are optimized by means of a metaheuristic technique: We emphasize the following feature as an SMFC relevant feature on other recent common models used for prediction in the software engineering field: on the X-axis is graphically represented those specific values of any of the independent variables; however, on the Y-axis of the SMFC transformation space occurs a special feature: the dependent variable values do not explicitly appear, but they are implicitly contained in the values represented on the Y-axis. Thus, the application of elemental algebraic operations allows for explicitly representing the DS value from r. In this final step, a DS predicted value is generated for a software enhanced project contained in the testing data set. This procedure is described in the Section 4.4 of this article.

Metaheuristic Search
SMFC uses a metaheuristic optimization technique for finding the best parameters for the SLR model. The prediction problem is now seen as an optimization problem regarding the transformation parameters with the objective of obtaining the best results for the SLR model.
To tackle optimization problems, almost any search heuristic could have been used. However, metaheuristic approaches are able to approximate a solution close to the global optimum in a relatively brief time because of their ability to stave off stagnation at local minima by accepting worse solutions with a non-zero probability. The metaheuristic technique chosen for our proposal is simulated annealing, whose basic algorithm for a minimization problem is the following [36]: 1.
Create a starting candidate solution randomly. Then, evaluate its worth as a measure of its energy E. The system is then heated up to a starting temperature T, which is usually a high value.

2.
Move the search along by slightly modifying the starting candidate solution and evaluate its energy E new . The process for generating the new solution is domain-specific, and the search will greatly depend on a proper neighbor generation method.

3.
Calculate the probability that the modified solution is accepted: that is, always accept a solution that descends down the gradient and therefore is better. Otherwise, the probability of the system accepting a solution worse than the current one depends on the temperature of the system. As the system lowers its temperature, this probability becomes smaller.

4.
Cool down the temperature T according to the previously specified cooling schedule. This scheduling determines how soon the algorithm will likely stop accepting worse solutions. This, in turn, has the consequence of determining how soon it will start performing local rather than global exploration. Different cooling schedules have been proposed such as exponential and linear descents of temperature.

5.
Check whether the algorithm needs to stop depending on the pre-defined stopping conditions. If this is not the case, a new iteration is necessary, and the algorithm returns to Step 2. Often used conditions include reaching a target fitness value or exceeding an established limit of iterations.

The SMFC Model
Let us start with a training data set of N software enhancement projects: Each training software enhancement project contains two independent variables (UFP µ , MTS µ ) and a dependent variable DS µ (i.e., UFP µ and MTS µ act as predictors for DS µ ).
Note that the graph of this problem lies in the three-dimensional space, because three variables are involved. To generate a graph, we should first graph the ordered pair (UFP µ , MTS µ ) on an X-Y plane and then locate the value of DS µ C on the Z axis.
If we were to consider the values with which the fourth illustrative case was exemplified, we would have that the pair (UFP 1 , MTS 1 ) = (4.34, 1.10) would be plotted on the X-Y plane, and for that point the value DS 1 = 2.04 would be located on the Z axis.
One of the great advantages of models that belong to minimalist machine learning was emphasized in Section 3; with the new paradigm it is possible to reduce any problem of pattern classification to a graphical problem on the Cartesian plane.
Since the SMFC model belongs to the new paradigm because it is an adaptation to the regression task, the graph of the values of the variables of each project can already be expressed in the Cartesian plane with all the advantages that this brings. The application of expression 20 allows us to work in the Cartesian plane, no longer in three-dimensional space, as it is mandatory to do with the original data, without transforming.
The four algorithmic steps including the three SMFC parts described in the three previous subsections are described and exemplified next: Step 1: The five transformations described in Section 4.1 are applied to each software enhancement project µ with µ ∈ {1, . . . , N} being taken from the training set, such that the following variable transformation is obtained: where a and b are the transformation parameters, and r µ is the resulting transformed variable. Note that r µ is a real value, and all original variables intervene in the creation of the transformed variable: both independent and dependent.
In expression (20), the value of r µ for each project is obtained through a small number of elemental operations. Consequently, the processing of the N projects of the training set has a running time complexity O(N).
This means that in Step 1 corresponding to the learning phase, the part of our proposal related to minimalist machine learning has linear complexity.
Step 2: An SLR is applied to the variables using the UFP µ values to be graphically represented on the X-axis and the values of r µ to be graphically represented on the Y-axis. This is done to fit the values into a linear function.
After completing this step, it is now possible to graph the problem on the Cartesian plane.
Step 3: An r µ value implicitly having the DS µ pred predicted value is obtained by each pair of a and b parameter values, as well as to each software enhancement project µ with µ ∈ {1, . . . , N} taken from the training set. The term DS µ pred can be algebraically expressed in an explicit manner for the r µ expression. Since the software enhancement projects belong to the training set, to each project µ with µ ∈ {1, . . . , N} is previously known its correct value for DS µ ; therefore, it is possible to calculate the absolute residual (AR) generated from the SLR to that µ: In this article, absolute residuals are used as prediction criterion to evaluate the performance. Now, simulated annealing is applied for finding the a and b parameter values minimizing the mean of the absolute residuals (MAR): After applying the previous three steps to the problem data, the SLR model has been generated having the a and b optimized parameters: a opt and b opt .
Step 4 is of great importance because it consists of applying the SLR model obtained with the three previous steps to the test patterns. For each project, with the application of Step 4, a value for delivery speed can be estimated. It is now possible to estimate the delivery speed for testing patterns.
Step 4: Let t be the index of a software enhancement project belonging to the testing data set. The UFP t value is localized on the X-axis of the SLR optimized by the a opt and b opt parameters. Then, its corresponding r t is obtained.
The DS t pred predicted value can be implicitly expressed from the r t expression as follows: Note that in Expression (23) all the values are known, except for DS t pred . By means of elementary algebraic operations, this value is obtained: In Expression (24), the value of DS t pred for each testing project is obtained through a small number of elemental operations. Consequently, the processing of each of the testing projects has a running time complexity O(1).
This means that in Step 4 corresponding to the operation phase, the part of our proposal related to minimalist machine learning has constant complexity.
From the value obtained in (24) and the DS t value that is known from the formulation of the problem, the absolute error is calculated for project t: Finally, with all E t values, the performance of the SMFC model is estimated by calculating the mean of the absolute error:

Data Sets Used for Training and Testing
The ISBSG Release May 2017 data set includes data from 8012 projects implemented between 1989 and 2016. ISBSG includes four types of development: new, enhancement, migration, and re-development. In our study, enhancement projects were selected.
The data sets of enhancement projects for our study were selected observing the "Guidelines for use of the ISBSG data", that is, taking into account data quality, sizing method, development platform, and programming language generation [37].
In accordance with their data quality, sizing method, development platform, and programming language generation, data sets of enhancement projects were selected. The ISBSG reports the delivery speed as functional size units by elapsed month (i.e., UFP/month).
The UFP value is a composite value calculated from five independent variables (inputs, outputs, inquiries, internal files, and external files), whereas the number of participants is termed as max team size (MTS), which is defined by the ISBSG as "The maximum number of people during each component of the work breakdown who are simultaneously assigned to work full-time on the project at least one elapsed month" [3].
The counting for UFP consists of a process that involves two data functions (i.e., internal logical file, and external interface files), and three transactional functions (i.e., external inputs, external outputs, and external inquiries) [9]. Table 1 shows the number of projects by applying the two first mentioned criteria for this study. ISBSG classifies data quality, and function point rating quality from "A" to "D".
Our study considered only those "A" and "B" software projects once they were suitable for statistical analysis. As for functional sizing methods, ISBSG reports several ways to measure functional size such as COSMIC, Dreger, Feature Points, FiSMA, Fuzzy Logic, Gartner FFP, IFPUG, Lines of code, Mark II, and NESMA [3].
Since pre-IFPUG V4 projects should not be mixed with V4 and post V4, we only selected projects whose count approach were IFPUG V4+. NESMA was also considered once it could be mixed with IFPUG V4+. In total, 3521 of the 3986 projects of Table 1 were excluded for having empty values in any of their following fields: development platform (1719 projects), max team size (1255), speed of delivery (278), and language type (269).
With the goal of proposing a model for larger projects, of the remaining 465 projects, only those projects having a value higher than or equal to three for both speed of delivery and max team size, were only selected for our study (a total of 65 projects were excluded).
In accordance with the ISBSG, the development platform is classified based on the operating system used: personal computer (PC), mid-range (MR), mainframe (MF), or multiplatform (Multi); (2) programming language generation: second (2GL), third (3GL), fourth (4GL) generation, and application generator (ApG); (3) relative size measured in UFP, XS: 10 and <30, S: 30 and <100, M1: 100 and <300, M2: 300 and <1000, L: 1000 and <3000.  Table 2 shows the 400 enhancement projects classified by (1) development platform. Since our objective is to propose a model that has a better generalization, those data sets that have less than 30 enhancement projects in Table 2 were excluded. Thus, seven data sets were used in our study for training and testing the models. A regression analysis for the seven data sets was performed. Scatter plots were generated by correlating DS and UFP, and DS and MTS. The fourteen obtained scatter plots showed skewness (they showed fewer large projects than small projects), heteroscedasticity (the variability of DS increased with either UFP or MTS), and outliers (they presented extremely large data values). Given that these three features were presented, each data set was normalized applying the natural logarithm (ln) [20].
The ANOVA p-values for the seven MLR equations were equal to 0.000, that is, there was a statistically significant relationship between the variables at the 99% confidence level for the seven equations showed in Table 3.  Table 3 also shows the coefficient of determination (r 2 ) by MLR. This coefficient indicates the proportion of the variance in DS that is explained from the independent variables (i.e., UFP and MTS). In accordance with r 2 values of Table 3, the two independent variables explained the proportion of the variance in DS in more than 59% of the seven data sets.
Prior to displaying the results, some of the notation is summarized in Table 4.

Experimental Results
A leave-one-out cross-validation (LOOCV) method was applied to train and test the MLR, MLP, the two types of SVR, FR, and the SMFC model, because it leaves out nondeterministic selection for training and testing sets, whereas the prediction performance for the models was calculated from absolute residuals (ARs) since ARs are an unbiased measure [9].
For each project i, with i ∈ {1, . . . , N}, the AR i is obtained as follows: where DS i pred is the predicted delivery speed and DS i is the actual delivery speed for project i. The mean (MAR) of the N software enhancement projects was obtained as follows: The median of all the AR i is represented as MdAR. The performance of a prediction model is inversely proportional to the MAR and MdAR.
In addition, standardized performance (SA) and effect size (∆) performance measures were used. The SA examines whether the prediction model generates predictions better than random guessing. The ∆ examines that the predictions are not produced by chance. The value of ∆ is recommended to be larger or equal to 0.5.
SA and ∆ are calculated as follows [38]: where MAR P 0 is the mean value of a large number, typically 1000, runs of random guessing.
where S P 0 is the sample standard deviation of the random guessing strategy.
As for the MLR model, a LOOCV was performed by data set from Table 3, and an MAR by data set was calculated. Thus, a total of 125, 35,36,78,29,55, and 30 MLR equations of type ln(DS) = a + b · ln(UFP) + c · ln(MTS) (31) were generated by each data set. The number of neurons for the hidden layer in the MLP, as well as the kernel, was changed for each type of SVR until obtaining the best MAR. Table 5 contains the final values for MLP and SVR having the best prediction performance.
The math expression by SVR kernel of Table 5 is the following, where x, y are data patterns: • Radial basis function: K(x, y) = e (−γ|x−y| 2 ) , where the γ parameter controls the radial base function spread; • Linear: K(x, y) = xy; • Polynomial: K(x, y) = (γ(xy) + c 0 ) d , where γ is a slope parameter, c 0 is a trade-off between major terms and minor terms of the generated polynomials, and d is the polynomial degree.
Regarding the training and testing of SMFC, the LOOCV process consisted of simulated annealing finding the transformation coefficients a and b in the expression: that minimize the training error for the corresponding iteration. Then, those same coefficients are used to predict the DS of the test project. The absolute difference between the predicted and actual DS of the project currently serving as test pattern is saved. This is repeated until all test patterns have acted as test inputs, and the mean of the errors is reported. Table 6 shows the MAR and MdAR obtained when SMFC and the other four models were applied to the seven data sets included in Table 3. Since results should be reported based upon statistical significance [20], the selection of a statistical test to compare the prediction performance between SMFC and each of the other four models (MLR, MLP, SVR, and FR) was based on the number of data sets to be compared, data dependence, and data distribution.
Since the five models were applied to each enhancement project, data are dependent; therefore, an additional data set of differences calculated between each pair of data sets (ARs from the SMFC and ARs from the other models) was compiled.
If this additional data set resulted normally distributed after applying the Chi-squared χ 2 , Shapiro-Wilk, skewness, and kurtosis statistical tests, a parametric t-paired test (based on means) was applied.
Otherwise, a non-parametric Wilcoxon test (based on medians) was applied to statistically compare the performance between SMFC and each other model. Both Wilcoxon and t-paired are statistical tests used when two data sets are compared. Table 7 contains the p-value by data set after applying the corresponding statistical test (the χ 2 was not performed in some data sets since it needs at least thirty data to be applied). Table 8 contains the p-value by data set after applying the Wilcoxon or t-paired test between SMFC and models MLR, MLP, SVR, and FR. Table 7. p-value by data set after applying the corresponding normality statistical test obtained from a data set of differences of absolute residuals (ARs) between search method based on feature construction (SMFC) and each model (ID: insufficient data). One of the most important results of this article that shows the superiority of the proposed model is that in accordance with the prediction performance values of Table 6 and p-values of Table 8, the SMFC resulted statistically better than MLR in four data sets at 95% of confidence (data sets containing 36, 78, 29, and 30 projects).

Statistical
Furthermore, it is important to emphasize the fact that in all other cases, the methods with which our proposal is compared are not statistically better than SMFC. In those cases of Table 8 where there was statistical difference between SMFC and FR, it was always in favor of SMFC.

Discussion, Conclusion, and Future Work
Although software development has evolved into a high-paced business tackling challenges of demanding contexts, such as those pertaining to the need for short development cycles and fast time-to-market [25], still several organizations show an unclear relationship between software business and software development [1].
Our article proposed a model named SMFC to predict a type of productivity in software organizations (i.e., software enhancement delivery speed). The SMFC prediction performance was compared to those of MLR, MLP, two types of SVR (i.e., ε−SVR and υ−SVR), and FR. They were trained and tested using seven data sets obtained from ISBSG, which is a repository widely used in the software prediction field [7,39]. These data sets were selected based upon their data quality, sizing method, development platform, and programming language generation as suggested in the guidelines of the ISBSG Release May 2017.
A large number of authors have taken on the task of estimating the computational complexity of the algorithms against which our proposal was compared in the experiments in Section 6. In this short discussion on complexity, we will assume that the training set consists of N projects, and that each project consists of p features, where for the particular case of the sets of projects included in this paper, the value of p is fixed at 2. In addition, for all the algorithms we have taken the corresponding state of the art data to the worst case run-time complexity.
Under these assumptions, the complexity of the SVR algorithm in its learning phase is O(N 2 p + N 3 ), while the prediction is made in O(pn sv ), where n sv is the number of support vectors [40,41]. For this paper, the expressions for the complexity of the SVR algorithm are reduced to O(2N 2 + N 3 ) and O(2n sv ), respectively.
The case of MLP is remarkable, because the complexity of the algorithm, both for learning and for prediction, depends on the topology of the neural network, in addition to the implementation. Some authors have estimated the complexity in the learning phase as O(NpHeE), where H is the number of hidden neurons, E is the number of output values, and e is the number of epochs [42]. Depending on the problem and the number of layers in the network topology, the E and e values can be very large. In many cases of MLP applications, the magnitude of these values causes the execution time of the MLP learning phase to be really long. Consequently, sometimes this process takes weeks and even months to run, even though the computer equipment is high-performance [43]. Additionally, it is necessary to emphasize that the described situation corresponds to the best of the cases, when the network topology allows convergence; otherwise, the learning never ends because the network does not converge to some valid result.
Due to the fixed value of p, in this paper the expression for the learning complexity of the MLP algorithm is reduced to O(2NHeE). Assuming that the convergence of the MLP was achieved, the prediction is made with an estimated complexity as O(s 1 s 2 + s 2 s 3 + . . . + s k−1 s k ), where k is the number of layers, and s j is the size of the j − th layer.
From the perspective of computational complexity, the MLR algorithm is more suitable than the SVR and MLP algorithms when the number of features is small [28]. Given that the complexity of the learning phase of the MLR algorithm is O(p 2 N 2 + p 3 ), for the experimental data of this paper this complexity is reduced to O(4N 2 + 8), thus making the learning complexity of the MLR algorithm linear with respect to the number of projects N.
Let us now analyze what happens with the computational complexity of our proposal, the SMFC, considering that in the learning phase the algorithm consists of three parts (MML transformations, SLR, and simulated annealing), and that the prediction complexity for a pattern of test is constant.
Since there is only one feature in the SLR (an advantage that is achieved as a consequence of the MML transformations) the complexity of the SLR as part of the SMFC is less than the MLR. When taking into account that the complexity of the MML transformations is linear, as established when analyzing Expression (20) in Section 4.4, the complexity of these two parts is less than the MLR and the FR. Here, it is clear that the complexity of the first two parts of the SMFC is largely less than the complexities of both remaining models: SVR and MLP.
Regarding the third part of the SMFC, which consists of applying a metaheuristic search to optimize the parameters of the SLR model, the estimation of the complexity is not so straightforward. In this paper, we have selected simulated annealing to optimize the parameters of the SLR model [44]. However, the authors of the state of the art in metaheuristics agree that the complexity of this algorithm depends entirely on the problem to be solved. Therefore, it is not possible to speak of a generic expression for complexity.
The spectrum of types of computational complexity thrown up by the algorithms where simulated annealing is involved to solve problems is very broad. It is possible to easily verify that the complexity depends entirely on the area of application and the problem being tackled. In order to illustrate this great diversity, it is pertinent to mention that when an efficient version of the simulated annealing method has been applied to a variant of the bin-packing problem, the computational complexity of the method is linear on input size [45]. When applying simulated annealing to the problem of finding the maximum cardinality matching in a graph, it is shown for arbitrary graphs that a degenerate form of the basic annealing algorithm produces matchings with nearly maximum cardinality in polynomial average time [46]. Additionally, in the matter of computing the volume of a convex body in R n , when applying a variant of simulated annealing, the complexity is O(n 4 ), where n is the dimension of the hyperspace where the convex body is immersed [47].
As regards to the problem that has been addressed in this paper, something is worth mentioning. When analyzing the execution times, a remarkable fact is clearly exhibited which is explained here. It turns out that when predicting the delivery speed of software enhancement projects, the complexity estimate is irrelevant, because in software engineering project sets are typically small. In the implementation that was designed to carry out all the experiments in Section 6, the MLP algorithm took less than two seconds, while the SVR algorithm took less than a second to complete both phases. Our proposal, SMFC, and regression algorithms (both MLR and FR) took less than half a second. In other words, the times are extremely small, and the differences are minimal, so in this case it is not productive to take into account the complexities of the algorithms.
After a statistical prediction performance comparison between SMFC, MLR, MLP, SVR, and FR was performed on the seven data sets (Tables 6 and 8), we can accept the following hypothesis derived from that formulated in the introduction for four of the seven data sets: H1: The delivery speed prediction performance of software enhancement projects with SMFC is statistically better than the performance obtained with MLR when the UFP and number of practitioners are used as the independent variables.
We conclude that SMFC can be used for predicting the DS of the following types of software enhancement projects: • Mid-range, and coded in 4GL; • Multi-platform, and coded in 3GL;