1. Introduction
One of the most complex and critical tasks faced by a project manager during software development is estimating the total effort and duration needed to meet the initial requirements. It is considered one of the major challenges in software engineering [
1], as a more accurate estimation increases the chances of success in software project development, completion, and delivery within the specified budget and schedule.
The diversity of software projects has led to the use of many techniques for effort and duration estimation. To support project managers in their tasks, various algorithms (including those based on artificial intelligence [
2]) have been used to increase the accuracy of software development effort and duration estimation. Using a dataset to build predictive models is essential for accurately estimating the effort and duration required in software engineering projects [
3]. Currently, there is a wide range of datasets available, such as the Albrecht, COCOMO81, China, Desharnais, ISBSG, Kemerer, Kitchenham, Maxwell, Miyazaki, NASA, and Tukutuku datasets. Deep learning techniques [
4] typically perform well on relatively large-scale datasets and have demonstrated strong capabilities in estimating target variables in classification and predictive modeling tasks. Since the size of the datasets listed above is relatively small, this study investigates the efficiency of different types of traditional artificial neural networks [
5], as well as hybrid artificial neural network architectures [
6] obtained by combining them with Random Forests [
7], when applied to such datasets.
To support the full comprehension of this research study, the article structure was designed as follows:
The
Section 1 clarifies the motivation that led to the choice of the research topic.
The
Section 2 summarizes the current status and evolution regarding the estimation of software development effort and duration.
The
Section 3 includes the rationale for selecting the four datasets used, as well as a description of their structure.
The
Section 4 describes, in detail, the approach adopted for estimating the effort and duration associated with software development adapted to the following three categories of existing datasets: small-sized, medium-sized, and large-sized.
The
Section 5 includes an analysis of the results obtained by the used artificial neural networks (traditional and hybrid in combination with Random Forests), after the parameter tuning process. At the same time, a new and innovative hybrid architecture, referred to as FractalNN_RF, is introduced, resulting from the combination of Fractal Neural Network with the Random Forests algorithm. By integrating these two paradigms, it is anticipated to obtain a model capable of improving the accuracy of estimates, increasing the stability of predictions, and providing superior generalization in contexts characterized by structural complexity, especially for medium-sized datasets.
In the
Section 6, a comparison of the implemented architectures was conducted, based on five selected metrics, identifying which architecture is optimal for each type of dataset.
The
Section 7 summarizes the relevant conclusions, highlighting the implications of implementing the proposed intelligent methods on datasets of different sizes.
2. Literature Survey
At present, numerous studies are dedicated to the estimation of software development effort and duration, each with their own strengths and limitations. Over time, various methods have been applied in these studies, including those from statistics [
8], graph theory [
9], heuristic approaches [
10], fuzzy logic [
11,
12], evolutionary computation [
13], machine learning [
14], and artificial neural networks [
15]. These studies rely either on public datasets or on private datasets belonging to specific organizations. The choice of dataset type significantly influences both the accuracy and applicability of the resulting estimations. Given the considerable number of studies dedicated to estimating the effort, duration, and cost of software development, the specialized literature includes several articles offering comparative analysis of the results obtained so far.
The research presented in [
16] provides a comprehensive analysis of contemporary trends in the field of software effort estimation, with the objective of grounding future research directions. The paper presents a detailed comparison of relevant contributions, organized in reverse chronological order, highlighting the techniques used, the metrics applied, the reported methodological limitations, as well as the main conclusions drawn by various authors. Overall, the analyzed literature reveals the continuous evolution and a significant diversification of approaches in software effort estimation.
Study [
17] investigates the application of machine learning techniques in estimating the effort required for software development, with a particular focus on the benefits brought by ensemble methods. The research initially identified 558 relevant papers in the field, from which, after a rigorous selection process based on quality criteria, 40 articles were retained for in-depth analysis. The study conclusions highlight that the integration of ensemble techniques, in both supervised and unsupervised learning, significantly contributes to improving the accuracy of software effort estimations.
The systematic review conducted in study [
18] explores the use of ensemble learning techniques and other artificial-intelligence-based strategies in estimating the effort required for software projects. The review focuses on modern methods involving machine learning, neural networks, and large language models, with the primary goal of improving estimation accuracy. Through extensive research conducted in major scientific databases (ACM Digital Library, IEEE Xplore, ScienceDirect, and Scopus), 826 empirical and theoretical studies were identified, 66 of which were selected for detailed analysis. The findings highlight that machine-learning-based methods have become dominant, with most of the analyzed studies confirming their substantial contribution to increasing estimation accuracy and optimizing software project management. In contrast, the use of non-machine-learning artificial intelligence techniques, such as Bayesian networks, remains limited, and the adoption of large language models is still in its early stages of development and application.
The emergence of modern machine learning (ML) techniques and, more recently, automated machine learning (AutoML), has brought significant transformations to the field of software development effort estimation, contributing to increased accessibility, efficiency, and accuracy in the estimation process. Study [
19] presents a systematic literature review on the application of ML and AutoML in software effort estimation, highlighting the relevance of the topic, the methods used, the identified advantages, and the volume of existing research. The adopted methodology involved selecting and analyzing 43 articles published in the last decade, based on the techniques implemented—either conventional machine learning or AutoML. The review findings indicate that in most of the analyzed studies, researchers employed ML techniques for software effort estimation, while the application of AutoML remained limited, thus revealing considerable potential for future research in this area.
The aim of the study presented in [
20] is to identify the most effective method for estimating the effort required in software development, using the Long Short-Term Memory and Stacked Long Short-Term Memory machine learning algorithms. The study employs six datasets: China, Kitchenham, Kemerer, COCOMO81, Albrecht, and Desharnais. Additionally, it evaluates performance using three metrics: root mean squared error, mean absolute error, and R-squared. The results indicate that Stacked_LSTM algorithm provides the best performance across all metrics for the China, Kemerer, and Albrecht datasets. In contrast, the LSTM algorithm yielded better results for the Desharnais and Kitchenham datasets. For the China dataset, the performance of the Stacked_LSTM algorithm is demonstrated by the following evaluation metric values: 0.012 for MAE, 0.016 for RMSE, and 0.981 for R-squared. For the Desharnais dataset, the performance of the LSTM algorithm is demonstrated by the following evaluation metric values: 0.076 for MAE, 0.102 for RMSE, and 0.638 for R-squared. For the Kemerer dataset, the performance of the Stacked_LSTM algorithm is demonstrated by the following evaluation metric values: 0.170 for MAE, 0.301 for RMSE, and 0.336 for R-squared. These results demonstrate a high level of model accuracy and a strong ability to explain the variance in the data.
Study [
21] presents an analysis of the use of machine learning techniques to improve software effort estimation, based on empirical datasets. Five public datasets were employed: ISBSG, NASA93, COCOMO, Maxwell, and Desharnais. The data were preprocessed by handling missing values and transforming categorical features. Four machine learning regression methods were evaluated: Linear Regression, Gradient Boosting, Random Forests, and Decision Tree. Additionally, correlation-based feature selection was applied to identify relevant feature subsets and reduce dimensionality. The comparative analysis focused on two key metrics: R-squared and root mean squared error to evaluate prediction accuracy. The results show that Linear Regression and Random Forests models significantly outperformed the other approaches for the effort estimation task when correlation-based feature selection is applied. The conclusions suggest that correlation-based feature selection can enhance machine learning models for software effort estimation.
Study [
22] employed a deep learning model to estimate the effort required for software development. The data preprocessing stage involved cleaning, normalization, and handling missing values, followed by their imputation. For prediction modeling, an innovative network (Multilayer Perceptron-assisted Honey Bidirectional Gated Recurrent Feed Forward Network) was developed, supported by an adaptive optimization algorithm (A-HBa), which adjusted the model parameters to achieve superior performance. The datasets used include the Albrecht, China, Desharnais, Kemerer, Kitchenham, and COCOMO81 datasets. The evaluation, based on mean absolute error, reported values such as 0.0763 for the China dataset, 0.0737 for the Desharnais dataset, and 0.0754 for the Kemerer dataset.
Article [
23] introduces the NIVIM model, a method for imputing missing values based on variational autoencoders (VAE) and synthetic data. By combining contextual and similarity-based information, the model generates an extended dataset (SDEE) and applies contextual imputation to improve data quality. NIVIM stands out for its broad applicability as a preprocessing technique and for its superior performance compared to VAE, GAIN, kNN, and MICE methods. The proposed model brings statistically significant improvements across six benchmark datasets—ISBSG, Albrecht, COCOMO81, Desharnais, NASA, and UCP—achieving an average reduction in RMSE between 11.05% and 17.72%, and in MAE between 9.62% and 21.96%. For the Desharnais dataset, the performance of the NIVIM model is highlighted by the following evaluation metric values: MAE = 0.0699, RMSE = 0.1134, and CD = 0.6432.
Accurate effort and duration estimation in software development is one of the most challenging and widely debated issues in the field. It is essential for effective project management, yet its complexity makes it a particularly difficult subject of research. Therefore, accurate effort and duration estimation in software development represents a major challenge in research.
3. Used Datasets
To accurately assess and estimate the effort and duration required for software product development, researchers in the field of software engineering rely on various datasets collected from real-world projects. Among the most well-known and frequently used datasets in software engineering are Albrecht, COCOMO81, China, Desharnais, ISBSG, Kemerer, Kitchenham, Maxwell, Miyazaki, NASA93, and Tukutuku, and detailed analysis of these datasets are presented in article [
24].
Article [
25] proposes a classification of datasets into three categories, based on the optimal spacing theorem formulated by Eubank [
26]. According to this theorem, the quantile function of the density is divided into four intervals: Q1 (first quartile), Q2 (second quartile), Q3 (third quartile), and Q4 (fourth quartile). The first category corresponds to Q1, the second to Q2 and Q3, and the third to Q4. Based on this classification, an SEE (Software Engineering Estimation) dataset is considered small-sized if it includes, at most, 43 project instances, medium-sized if it contains between 44 and 146 instances, and large-sized if it exceeds 147 instances. In the present study, to accurately approximate software development effort and duration, four datasets were selected, one for each quartile, as follows: China (Q4), Desharnais (Q3), Kemerer (Q1), and Maxwell (Q2). The selection of these datasets was based on their relevance in the field of software engineering, the public availability of the data, the size of the datasets, and the diversity of the information included (including actual values for the software development effort and duration), thus ensuring a solid foundation for the comparative analysis and validation of the proposed methods.
Table 1 presents both the number of projects analyzed in each dataset and the number of attributes used in this study. The last two columns of
Table 1 include the units of measurement corresponding to the two output attributes. The unit of measurement used for the attribute representing the development duration of a software project is the calendar month.
For the China and Desharnais datasets, the effort required for software development is measured in person-hours, while for the Kemerer and Maxwell datasets, the unit of measurement for effort is person-months.
Each of the 499 projects included in the China dataset [
27] contains a series of essential characteristics for the analysis and estimation of software development projects, represented by numerical values. In this study, fifteen attributes were used (thirteen as input data and two as output data). The meaning of these fifteen attributes, along with their numerical characteristics (minimum value, maximum value, mean, and standard deviation), is presented in
Table 2.
Desharnais dataset [
28] contains information extracted from 81 completed software projects, including variables that describe the characteristics of the projects and the teams that developed them. The meaning and numerical characteristics of the ten attributes used in this study (eight input variables and two output variables) are presented in
Table 3. Among the eight input variables, the last one, labeled Language, indicates the type of programming language used and is encoded as follows: 1 for first-generation programming languages (e.g., Assembly), 2 for third-generation programming languages (e.g., C++, Java), and 3 for fourth-generation programming languages (e.g., SQL, Oracle Forms).
The Kemerer dataset [
29] is a classic dataset used in the estimation of software development effort and duration, built from the acquisition of seven characteristics collected from 15 real software projects. The meaning and numerical characteristics of the seven attributes used in this study (five input variables and two output variables) are presented in
Table 4. This table provides details on the distribution of these attributes and their impact on the estimation of software development effort and duration.
Each project in the Maxwell dataset [
30] includes a set of essential characteristics for the analysis and estimation of software development projects, represented by numerical values. The Maxwell dataset comprises a total of 62 distinct projects, each containing 26 attributes, of which 22 are independent and 4 are dependent. Out of the four dependent attributes, the following two were used in this study: the effort required to complete the project, measured in person-hours per month, and the total development duration, measured in months. Information about the 24 Maxwell dataset attributes used in this study is presented in
Table 5.
With 499 projects, the China collection is considered a large-sized SEE dataset according to the classification provided in [
25]. With 15 projects, the Kemerer collection is classified as small-sized according to the previously mentioned classification. Both the Desharnais dataset, which includes data from 81 projects, and the Maxwell dataset, containing data from 62 projects, are classified as medium-sized datasets.
5. Analysis of Implemented Artificial Neural Networks
In most artificial neural networks, parameters are essential variables used to learn the characteristics of the dataset and to adjust the learning process with the goal of achieving optimal performance. Parameter tuning [
48] procedure, aimed at identifying the ideal configuration for each neural network to ensure that the predicted outcomes are as accurate and efficient as possible, was applied in this research to all the models under analysis.
5.1. Multilayer Perceptron
MLP implementation was carried out using the MLPRegressor function belonging to the sklearn.neural_network library, with multiple configurations tested based on different hyperparameter values. The MLP architecture used in this study is characterized by the following components:
The input layer contains a number of neurons automatically determined by the number of input attributes in dataset: 13 neurons for the China dataset, 8 for the Desharnais dataset, 5 for the Kemerer dataset, and 22 for the Maxwell dataset.
The hidden layer consists of a variable number of neurons, ranging from 20 to 200, incremented in steps of 20.
The output layer includes two neurons, each corresponding to one of the two following estimated values: software development effort and duration.
The ReLU activation function is used for the hidden layer.
The model is trained using the Adam optimizer.
The number of epochs varies between 100 and 1000, in increments of 100.
Following the used values in the parameter tuning process, 10 values for parameter
e (number of epochs) and 10 values for parameter
n (number of neurons from hidden layer), and 100 configurations of the MLP neural network were trained. The performance of each configuration was evaluated using the five selected metrics. In
Table A1, the third column presents the optimal values obtained for the five metrics applied to the 100 configurations of MLP network. The fourth and fifth columns indicate the values of the hyperparameters corresponding to the configurations for which these optimal performances were achieved for each metric. Columns six, seven, and eight in
Table A1 present information related to the estimated effort, while the last three columns provide details about the estimated duration, according to the MLP model for which the optimal values of the evaluation metrics were obtained. For the China dataset, three distinct MLP configurations were identified, each corresponding to the optimal values obtained for the five used metrics. It is noteworthy that the optimal values for RMSE, CD, and MSLE were produced by the same hyperparameter configuration. For Desharnais dataset, two optimal MLP configurations were identified; one configuration yielded the lowest values for MAE and MdAE, while another led to the best results for RMSE, CD, and MSLE. For the Kemerer and Maxwell datasets, a single hyperparameter configuration simultaneously yielded optimal results across all five metrics, suggesting a higher degree of model stability and robustness in these particular contexts.
5.2. Deep Fully Connected Neural Network
DFCNN, a feedforward and fully connected network, was designed with an input layer, ten fully connected hidden layers, and an output layer to solve a multivariate regression problem with two continuous output variables. It was trained with the objective of identifying the most performant combination of hyperparameters, specifically, the number of epochs and the number of nodes, based on the values obtained for the evaluation metrics. The DFCNN architecture used in this study is characterized by the following components:
The input and output layers were designed with the same structure as in the MLP network.
Ten hidden layers were implemented using Dense class from tensorflow.keras.layers, each of them using ReLU activation function. The number of neurons in each hidden layer is consistent within a given configuration and varies across experimental runs, ranging from 20 to 200 neurons, in increments of 20.
The model is trained using the Adam optimizer algorithm, with MSE employed as the loss function.
The number of epochs varies between 100 and 1000, in increments of 100.
This deep architecture allows for a flexible representation of complex nonlinear relationships within the data, while the systematic hyperparameter tuning aims to identify robust configurations that generalize well across different software engineering datasets. Based on the hyperparameter tuning process, where 10 values were tested for the parameter
e (number of training epochs) and 10 values for the parameter
n (number of neurons in the hidden layers), a total of 100 distinct DFCNN configurations were trained. In
Table A2, the third column reports the optimal values obtained for the five used metrics applied across these 100 configurations.
In the experiments conducted on the China, Desharnais, and Kemerer datasets, three distinct DFCNN configurations were identified, each leading to optimal values for the five analyzed performance metrics. It was observed that the same set of hyperparameters simultaneously yielded the best results for RMSE, CD, and MSLE, indicating the increased robustness of that particular configuration. For the Maxwell dataset, two optimal DFCNN configurations were identified. The first configuration resulted in the lowest values for MAE and MdAE, while the second configuration achieved superior results for RMSE, CD, and MSLE. These findings highlight the variability in model behavior depending on dataset characteristics, as well as the importance of selecting appropriate hyperparameter configurations to ensure optimal performance.
5.3. Fractal Neural Network
The application of Fractal Neural Networks in regression problems remains insufficiently explored, although in the past eight years several studies have used this type of neural network in classification tasks. A recent study [
49], published this year, proposes a hybrid variant for time series forecasting; however, in the field of software engineering, these architectures have not yet been utilized.
The innovative FractalNN architecture proposed in this study combines a recursive fractal structure with 1D convolutional blocks, specific to convolutional neural networks [
50], for addressing regression problems.
The innovative proposed architecture is composed of the following elements:
An input layer whose dimensionality corresponds to number of input attributes in the dataset (13 neurons for the China dataset, 8 for Desharnais, 5 for Kemerer, and 22 for Maxwell).
A fractal convolutional block, which is defined as a recursive structure controlled by a depth parameter (equal with four), generating two parallel branches at each level. The short branch applies a single Conv1D layer, while the long branch recursively applies two fractal blocks to the same input, illustrating self-similarity. The outputs of the two branches are merged via averaging, facilitating the integration of information across multiple scales. Implementation of convolutional layer was carried out using Conv1D class, belonging to tensorflow.keras.layers library [
45].
Dense layers for regression, which are applied after the fractal convolutional block; the output is first flattened into a vector, then passed through a Dense layer with 64 neurons and ReLU activation, and finalized with a Dense layer with 2 neurons and linear activation, corresponding to a regression task with two continuous outputs.
As part of the hyperparameter tuning process, 10 values were evaluated for the number of training epochs (denoted e) and 6 values for the number of filters in Conv1D (denoted f), resulting in a total of 60 unique FractalNN configurations. The number of epochs varies between 100 and 1000, in increments of 100, but for the number of filters in Conv1D, six discrete values were tested, which were as follows: 8, 16, 32, 64, 128, and 256.
Table A3 presents, in its third column, the optimal values obtained for five metrics applied across these configurations. In the experiments performed on the China and Desharnais datasets, 3 distinct FractalNN configurations were found to yield optimal results across the five performance metrics considered. It was observed that one particular set of hyperparameters simultaneously produced the best values for RMSE, CD, and MSLE, indicating a higher degree of robustness for that configuration.
In contrast, for the Kemerer and Maxwell datasets, 2 optimal FractalNN configurations were identified. The second configuration achieved the lowest MdAE value, while the first configuration delivered superior performance for MAE, RMSE, CD, and MSLE.
5.4. Kernel Extreme Learning Machine
The proposed KELM algorithm represents an extension of the traditional Extreme Learning Machine approach, in which the feature space is implicitly generated through the use of a kernel function. This strategy eliminates the need for explicitly optimizing the hidden layer weights and activations, thereby simplifying the training process. In the implemented architecture, the RBF (Radial Basis Function) kernel is employed to perform a nonlinear mapping of the input data, which enhances the separability of the data in the induced feature space. The main components of the KELM model are as follows:
RBF kernel Gram matrix captures the pairwise similarities between training samples in the transformed feature space, enabling nonlinear modeling through kernel-based methods.
Hyperparameter γ represents the coefficient of the radial basis function kernel and controls the spread or influence of the kernel function. Its value remains consistent within a given configuration and varies across experimental cycles. In this study, γ was tested over a range of values, as follows: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, and 50.
Hyperparameter λ denotes the regularization coefficient, employed to stabilize the inversion of the Gram matrix in the presence of multicollinearity or noise. In this study, λ was evaluated across the following values: 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, and 500.
Within the hyperparameter optimization process of KELM architecture (
Table A4), a grid search was conducted over ten distinct values for the radial coefficient (denoted γ) and ten values for the regularization coefficient (denoted λ).
This process yielded a total of 100 unique KELM configurations. The performance of each configuration was assessed based on five metrics, and the optimal values identified across these configurations are summarized in the third column of
Table A4. Experimental evaluations conducted on the China and Desharnais datasets led to the identification of two optimal configurations of KELM. The first configuration yielded superior performance across four metrics: MAE, RMSE, CD and MSLE, while the second configuration minimized MdAE. In the case of Kemerer dataset, a single configuration of hyperparameters simultaneously produced optimal values for all five metrics, indicating the increased model stability and robustness within this specific data context. Regarding Maxwell dataset, three distinct KELM configurations were found to yield optimal results across the five metrics assessed. Notably, one of these configurations achieved the best values for RMSE, CD, and MSLE concurrently, suggesting a higher degree of reliability and generalization capacity for that particular hyperparameter setting.
5.5. Hybrid Artificial Neural Networks
To improve the metrics performance obtained for four previously implemented artificial neural network architectures (MLP, DFCNN, FractalNN, and KELM), the development of four corresponding hybrid neural network models was undertaken. These hybrid models consisted of combining each type of neural network with the Random Forests machine learning algorithm, aiming to capitalize on the complementary strengths of both approaches and enhance predictive accuracy. The hybrid neural networks were designed using cascade architecture, in which an artificial neural network (such as MLP, DFCNN, FractalNN, or ELM) is employed for automatic feature extraction from the input data. Subsequently, these extracted features are fed into the RF regressor, which leverages its ensemble learning capabilities to produce the final prediction. This combined framework is intended to improve robustness, generalization, and overall model performance beyond what either method could achieve independently.
5.5.1. Multilayer Perceptron Combined with Random Forests
The proposed hybrid artificial neural network architecture follows a cascade structure, integrating a MLP with a decision tree-based regressor (RF). This architectural combination (denoted as MLP_RF) aims to leverage the feature extraction capabilities of neural networks together with the robustness and generalization power of ensemble learning methods like RF, particularly in the context of multivariate regression tasks.
The first component of proposed architecture is an MLP network, consisting of an input layer adapted to the dimensionality of each dataset, followed by one fully connected hidden layer using ReLU activation function. The final dense layer of MLP produces a low-dimensional latent representation, which serves as a compressed and abstract feature vector derived from the input data. This latent vector is then used as an input to the RF model.
The second component is a multi-output RF regressor, trained to simultaneously predict two output variables. RF model operates on the latent features extracted by MLP and is responsible for generating the final predictions. During the hyperparameter tuning process for MLP_RF architecture, the following five key hyperparameters were optimized:
Number of hidden layer nodes in MLP network (denoted as n), which varied within the interval [100, 1000] with an increment step of 100.
Number of training epochs for MLP (denoted as e), explored within the range [50, 500] with a step size of 50.
Number of estimators (denoted as s), controlling the total number of trees generated by RF model, ranging from 80 to 800 with an increment of 80.
Maximum depth of the trees (denoted as d), influencing the complexity of each individual decision tree.
Random seed (denoted as r), used to control the randomness of the RF training process.
During the hyperparameters’ tuning process, multiple combinations of predefined values for five important parameters of MLP_RF architecture were explored. The number of neurons in MLP hidden layer was varied using 10 values ranging from 100 to 1000, while the number of training epochs was adjusted across 10 values between 50 and 500. Additionally, 10 values between 80 and 800 were selected for the number of estimators in the RF model. For the maximum depth of the trees, the following five discrete values were tested: 4, 8, 16, 32, and 64. The random_state parameter was evaluated using four values (21, 42, 84, and 168) to ensure the reproducibility of the results. This configuration led to the generation of a large number of unique MLP_RF models, each evaluated based on five performance metrics applied to multivariate regression tasks.
Table A5 summarizes these results, highlighting, in the third column, the optimal values obtained for each metric, thus reflecting the superior performance of the corresponding configuration. In addition,
Table A5 reports the estimated software effort and duration (minimum, maximum, and mean) for each experimental configuration.
The experimental evaluations conducted on the China and Maxwell datasets led to the identification of three distinct configurations of MLP_RF hybrid architecture, each yielding optimal results with respect to the five analyzed performance metrics. Among these, one configuration stood out by simultaneously achieving the best values for RMSE, CD, and MSLE, suggesting the increased reliability and a high generalization capability associated with the specific parameter values used in that configuration. The best performance was observed in the China dataset, indicating a strong correlation between predicted and actual values. For the Desharnais and Kemerer datasets, two optimal MLP_RF configurations were identified. The first configuration demonstrated superior performance in four out of the five metrics (MAE, RMSE, CD, and MSLE), while the second configuration proved effective in minimizing the MdAE value.
5.5.2. Deep Fully Connected Neural Network Combined with Random Forests
Another proposed architecture for hybrid artificial neural networks involves the integration of a DFCNN with an RF regressor, within a modular and flexible architecture. In this configuration (denoted as DFCNN_RF), DFCNN serves as a latent feature extractor, generating abstract and informative representations of input data. These features are subsequently used by the RF regressor to perform the prediction, leveraging the DFCNN ability to learn complex representations and RF robustness in regression tasks. To enable the simultaneous prediction of two target variables, the model uses Scikit-learn MultiOutputRegressor wrapper, adapting the RF regressor to a multi-output regression setting. The entire ensemble is implemented as a custom Scikit-learn estimator, inheriting functionalities from BaseEstimator and RegressorMixin, thereby ensuring compatibility with hyperparameter optimization procedures. During the hyperparameter tuning process for hybrid DFCNN_RF architecture, the same five parameters previously used in MLP_RF hybrid architecture were applied. The use of these common hyperparameters facilitates a fair comparison between the two architectures, allowing for an objective evaluation of performance under similar experimental conditions. The hyperparameters’ tuning process led to the generation of a significant number of unique models of hybrid DFCNN_RF architecture, with each variant being evaluated based on five performance metrics corresponding to multivariate regression tasks. The results are summarized in
Table A6, where the third column presents the optimal values associated with each used metric.
The experimental evaluations conducted on the China, Desharnais, and Maxwell datasets led to the identification of three distinct configurations of hybrid DFCNN_RF architecture, each exhibiting optimal performance according to the five metrics used for multivariate regression tasks. One of these configurations stood out by simultaneously achieving the best values for RMSE, CD, and MSLE, indicating a high level of reliability and superior generalization capacity associated with the specific parameterization implemented.
Regarding Kemerer dataset, two DFCNN_RF configurations with notable performance were identified. The first achieved superior results for four out of the five analyzed metrics (MAE, RMSE, CD, and MSLE), while the second was distinguished by its efficiency in minimizing MdAE.
5.5.3. Fractal Neural Network Combined with Random Forests
As part of addressing the multivariate regression problem, a new innovative hybrid model named FractalNN_RF was developed, combining the feature extraction capabilities of the previously described FractalNN architecture with the robustness and generalization power of the RF regressor. This integration aims to leverage the strengths of both components to enhance prediction accuracy and stability. The proposed architecture is based on two main components. The FractalNN model acts as an extractor of latent features from the input data. The latent features extracted by the network are then used as input for second component, which is an RF regressor trained in multi-output mode. The novelty of this hybrid model, FractalNN_RF, lies in the synergistic integration of a fractal architecture, capable of capturing multiscale patterns through recursive self-similarity, with the robustness and generalization strength of Random Forests. This combination was specifically motivated by the need to address the challenges of limited-size datasets, where complex nonlinear feature extraction must be balanced with stability against noise and variability. Together, these strengths provide a balanced model that enhances both accuracy and generalization in software effort and duration prediction.
During the hyperparameter optimization phase for the hybrid model FractalNN_RF, four out of the five hyperparameters previously used in the MLP_RF and DFCNN_RF hybrid architectures were retained, with the same values applied. Among these, one pertains to the neural network component, the number of training epochs, while the remaining three are associated with RF regressor: the number of estimators, the maximum tree depth, and the random seed values. The fifth hyperparameter, specific to the FractalNN model, is the number of filters in the Conv1D layer (denoted as
f), for which six values from the discrete set {8, 16, 32, 64, 128, 256} were evaluated. The tuning process led to the development of a substantial number of unique FractalNN_RF configurations, each variant being assessed according to five relevant metrics of multivariate regression tasks. The evaluation results are summarized in
Table A7, where the third column highlights the optimal values associated with each metric used.
Training on the China, Desharnais, and Kemerer datasets led to the identification of three distinct configurations of FractalNN_RF hybrid architecture, each demonstrating optimal performance across the five metrics used to evaluate multivariate regression tasks. For the China dataset, one configuration achieved the lowest values for MAE and MSLE, another excelled in terms of RMSE and CD, while a third stood out by minimizing MdAE. For the Desharnais and Kemerer datasets, one configuration clearly distinguished itself by simultaneously delivering top performance in RMSE, CD, and MSLE. Regarding Maxwell dataset, two high-performing FractalNN_RF configurations were identified; the first showed superior results across four of the five metrics (MAE, RMSE, CD, and MSLE), while the second was notable for its efficiency in reducing MdAE.
5.5.4. Extreme Learning Machine Combined with Random Forests
The last proposed model employs a hybrid architecture (ELM_RF) that combines ELM for nonlinear feature extraction with an RF regressor responsible for predicting the target variables. The objective of this approach is to capture complex relationships within the data by projecting them into a latent feature space, followed by a robust and interpretable regression stage. The process begins with training the ELM model, where the number of neurons in the hidden layer is varied as a key parameter. This stage produces a latent representation of the data through a nonlinear transformation. The extracted features are then used as input for RF, which is trained to predict the target vector. The RF regressor is chosen for its robustness to noise and is fine-tuned by varying three hyperparameters: the number of estimators, the maximum tree depth, and the random seed value.
During the hyperparameter tuning process, multiple combinations of predefined values were investigated for the four previously specified parameters of the hybrid ELM_RF architecture. The number of neurons in the hidden layer of ELM model was varied using ten values ranging from 100 to 1000, in increments of 100 (
Table A8).
Likewise, for the RF model, 10 values were selected for the number of estimators, ranging from 80 to 800, with a step size of 80. The maximum tree depth was tested using the following five discrete levels: 4, 8, 16, 32, and 64. The random_state parameter, used to control randomness and ensure reproducibility, was evaluated with the following four values: 21, 42, 84, and 168. This strategy for exploring the hyperparameter space led to the generation of a significant number of distinct ELM_RF configurations, each of which was assessed based on five multivariate regression performance metrics. The results of these experiments are summarized in
Table A8, where the third column presents the optimal values obtained for each metric, thus highlighting the superior performance of the corresponding configuration.
Training ELM_RF model on the China and Kemerer datasets led to the identification of two distinct configurations of hybrid ELM_RF architecture, each achieving optimal performance across the five metrics used to evaluate multivariate regression tasks. For the China dataset, two high-performing configurations were identified; the first demonstrated superior results for four out of the five metrics (MAE, RMSE, CD, and MSLE), while the second stood out for its effectiveness in reducing the MdAE value. In the case of the Kemerer dataset, one configuration achieved the lowest values for MAE and MdAE, whereas another configuration excelled in terms of RMSE, CD, and MSLE. For the Desharnais dataset, a single configuration clearly stood out by simultaneously delivering top-level performance in RMSE, CD, and MSLE. Regarding the Maxwell dataset, a single high-performing ELM_RF configuration was identified, showing superior results across all five evaluated metrics.
6. Comparative Analysis of Implemented Artificial Neural Networks
To determine the most suitable estimation model based on the dataset size (small-sized, medium-sized, and large-sized), four types of artificial neural networks and four types of hybrid neural networks were compared, using the values of five evaluation metrics: MAE, MdAE, RMSE, CD, and MSLE.
Table 8 highlights the optimal values of used metrics for the eight prediction methods applied to the four datasets, which are as follows: China (large-sized), Desharnais (medium-sized), Kemerer (small-sized), and Maxwell (medium-sized). The results obtained are compared both with the proposed models and with the values reported in previous studies.
For the China dataset, the best performance, reflected by the minimum values of the MAE, MdAE, RMSE, and MSLE metrics, as well as the maximum value of the CD, was achieved by ELM_RF model, demonstrating its superior predictive capability. MAE (0.0046) and RMSE (0.0137) values indicate a high level of accuracy, with very low prediction errors. Although the RMSE is slightly higher than the MAE, suggesting a few instances of more pronounced errors, these remain limited overall. The minimum MdAE value (0.0013) confirms that model performance is not significantly affected by abrupt variations. Additionally, the very low MSLE value (0.0001) reflects an extremely small logarithmic error, which is a strong indicator of the model robustness. The CD, with a value of 0.9834, indicates an excellent fit between the predicted and actual values.
ELM_RF architecture, which combines ELM and RF models, leverages both the generalization capability of ELM and the robustness of RF to data variability. Therefore, this method may be considered a suitable choice for large-sized datasets, such as the China dataset.
For the Desharnais and Maxwell datasets, the innovative FractalNN_RF architecture achieves the best performance across all evaluated metrics. For the Desharnais dataset, the MdAE value (0.0236), being lower than the MAE (0.0573), indicates that most prediction errors are small, although the average is influenced by a few larger deviations. RMSE (0.0777), while still low, is slightly higher than MAE, confirming the presence of a few isolated cases with more significant errors. The CD value of 0.7135 reflects a good, though not perfect, fit between the predicted and actual values, suggesting room for improvement in capturing data relationships, an aspect that is typically challenging to optimize given the medium size of the Desharnais dataset. The MSLE value (0.0036) confirms that the proportional errors in the predictions are very small. For the Maxwell dataset, the MAE value (0.0957) is relatively moderate, while the lower MdAE value (0.0328) indicates that majority of predictions are accurate, with a few outliers increasing the overall mean error. RMSE (0.1629), being higher than MAE, further confirms the presence of certain predictions with notable deviations. The CD, with a value of 0.5320, suggests that the model does not fully capture the underlying relationships within the data, a limitation that is often challenging to address in the context of medium-sized datasets. The MSLE value (0.0118) indicates low errors in the logarithmic space, reflecting good proportional prediction performance by the model.
Hybrid FractalNN_RF architecture, which combines Fractal Neural Network with the Random Forests algorithm, represents an advanced approach that merges deep learning capabilities with the robustness of ensemble techniques. It is ideal for medium-sized datasets, such as the Desharnais and Maxwell datasets, where complex nonlinear relationships and data variations can significantly impact model performance.
For the Kemerer dataset, MLP architecture achieves the best performance across all five metrics, while the hybrid networks prove to be ineffective on this small-sized dataset. The MAE value (0.0173) highlights a high level of overall predictive accuracy, making it suitable for applications with strict precision requirements. MdAE (0.0202), being slightly higher than MAE, suggests a relatively uniform distribution of errors, without significant outliers. RMSE (0.0203), close to MAE and nearly identical to MdAE, indicates that there are no large-sized errors distorting the average, thus confirming the consistency and balance of the model predictions. The CD (0.9274) represents an excellent score, suggesting that the model appropriately captures the relevant relationships between variables, even in the context of a small-sized dataset. The MSLE (0.0003), with an extremely low value, reflects a high proportional accuracy. Moreover, the comparative analysis of CD values obtained for the eight proposed models reveals that hybrid networks generally exhibit inferior performance compared to standard models. The analysis of the metric values highlights that the proposed MLP model demonstrates a balanced performance and a high degree of reliability in its predictions, an aspect that is both rare and highly valuable. These characteristics make this model a suitable candidate for estimating software development effort and duration, particularly when working with small-sized datasets, such as the Kemerer dataset.
The superior performance of the simpler MLP model on the small-sized Kemerer dataset is attributable to its ability to avoid overfitting, a common issue in complex architectures with numerous parameters. In contexts with reduced datasets, an MLP model, characterized by a lower complexity, strikes an optimal balance between expressiveness and generalization, effectively capturing underlying patterns without amplifying noise. This observation is consistent with the bias–variance trade-off theory; for small datasets, models with lower complexity typically generalize better, whereas for medium or large datasets, more complex architectures are able to exploit the richer information available and achieve superior accuracy.
To reinforce the previous observations, the Wilcoxon signed-rank test was applied to the small-sized Kemerer dataset in order to assess the statistical significance of the performance differences between the MLP architecture and proposed hybrid architectures (MLP_RF, DFCNN_RF, FractalNN_RF, and ELM_RF). Each of the five models was trained over ten independent runs, and the Wilcoxon test was performed to compare their performance across five evaluation metrics: MAE, MdAE, RMSE, CD, and MSLE. The resulting
p-values, presented in
Table 9, range from 0.0005 to 0.0024, indicating that all observed performance differences are statistically significant at the 1% significance level (
p < 0.01). These findings provide strong empirical evidence that the simpler MLP architecture consistently outperforms the more complex hybrid models (MLP_RF, DFCNN_RF, FractalNN_RF, and ELM_RF) in the context of small-sized datasets, such as the Kemerer dataset.
This result supports the hypothesis that architectural simplicity enhances generalization capability under conditions of limited data availability, whereas hybrid models, due to their higher structural complexity, tend to be more prone to overfitting.
The proposed architectures were also compared with those presented in previous studies, as reviewed in the
Section 2. The optimal architectures developed for each dataset significantly outperform the values reported in the existing literature.
Table 8 highlights the superiority of the proposed hybrid architectures (ELM_RF for large-sized datasets and FractalNN_RF for medium-sized datasets), compared to traditional approaches and prior research results. In the case of small-sized datasets, the MLP traditional architecture still yields better performance.
7. Conclusions
This study focused on developing predictive models for estimating the effort and duration required to complete software projects, tailored to dataset size. An analysis of the software engineering domain revealed that open-source datasets within this field can be categorized into three main groups: small-sized, medium-sized, and large-sized datasets. In this study, four datasets were utilized: the large-sized China dataset, the medium-sized Desharnais and Maxwell datasets, and the small-sized Kemerer dataset.
For this purpose, eight artificial neural network architectures were proposed: four traditional models (MLP, DFCNN, FractalNN, and KELM) and four hybrid models, which were obtained by combining these with the RF algorithm, denoted as MLP_RF, DFCNN_RF, FractalNN_RF, and ELM_RF. The proposed architectures were analyzed and compared based on the following five evaluation metrics: MAE, MdAE, RMSE, CD, and MSLE.
Following the comparative analysis of the results obtained from the eight proposed architectures, it becomes evident that the hybrid neural network model ELM_RF demonstrates the highest effectiveness when applied to large-sized datasets. This conclusion is supported by its superior performance on the China dataset, where it achieved the best overall results across multiple evaluation metrics. The integration of the Extreme Learning Machine with the Random Forests algorithm appears to enhance the model capacity for nonlinear pattern recognition and reduce the tendency toward overfitting, resulting in improved generalization and predictive stability.
In contrast, for small-sized datasets, the traditional MLP architecture was the most suitable approach, as it yielded the best results on the Kemerer dataset. Furthermore, statistical validation using the Wilcoxon signed-rank test confirmed the robustness and significance of these results, reinforcing the conclusion that MLP provides a reliable and efficient solution for estimating effort and duration in small-sized software project datasets.
In this research paper, a new innovative hybrid neural network called FractalNN_RF was proposed, which combines a Fractal Neural Network with the RF algorithm for regression tasks. When comparing the performance of this optimized hybrid architecture with that of seven other prediction architectures, FractalNN_RF demonstrated superior results on medium-sized datasets and yielded the highest accuracy on the Desharnais and Maxwell datasets.