A New Machine Learning Forecasting Algorithm Based on Bivariate Copula Functions

: A novel forecasting method based on copula functions is proposed. It consists of an iterative algorithm in which a dependent variable is decomposed as a sum of error terms, where each one of them is estimated identifying the input variable which best “copulate” with it. The method has been tested over popular reference datasets, achieving competitive results in comparison with other well-known machine learning techniques.


Introduction
One way to make predictions of a variable, Y, from the values of another, X, is to compute the conditional expectation of Y|X. In the continuous case for example, if the g(y|x) function defines its conditional density, this expected value is calculated according to Equation (1), with D Y being the support of the Y variable.
In this equation, the conditional density function g(y|x) can be computed from the joint density function of (X, Y), h XY (x, y), and the marginal one of X, f X (x), using Equation (2).
Estimating marginal functions can be reasonably easy using estimators such as histograms, frequency polygons, Na-daraya-Watson kernels, or empirical distribution functions.
However, it is more complicated to propose a regular expression for joint distributions. One of the reasons that justifies this difficulty is the need to represent the true implicit dependence relationship between the variables.
In fact, if F X and G Y are the marginal cumulative distribution functions associated with X and Y, respectively, then it can be demonstrated that there are infinity joint cumulative distribution functions H X,Y with these marginals [1]. Furthermore, pairs of variables can be found, (X 1 , Y 1 ) and (X 2 , Y 2 ), where both X j and Y j variables have the same distribution with the same linear ρ X j ,Y j correlation coefficient too, but with different dependence structuresthat is, there may be two different joint distribution functions, H 1 and H 2 , associated with (X, Y), which would explain the dependency relationship between X and Y in a different way.
For that reason, it is important to identify the joint cumulative distribution function H XY that truly reflects the relationship between X and Y. Later, the conditional distribution could be built from it to make predictions using Equation (1). The methodology proposed in this paper is based on the estimation of these distributions through copula functions.
There are few works in which copulas are linked to the concept of machine learning. Certainly, when searching for "TS = (Copula* NEAR ('data mining' OR 'machine learning'))" in Web of Science (apps.webofknowledge.com, accessed on 25 May 2021), only 15 references were found. In the survey "Copulas in machine learning" [2] (2013), the author talks about the need for synergy between machine learning and copula frameworks and implores the researchers in the copula community to develop algorithms able to cope with highdimensional challenges. To cover this shortcoming, a novel forecasting methodology that uses bivariate copulas is proposed. This methodology is an extension of a thesis published in 2007 [3], in which copula functions were used to forecast, from incremental temperature variables, the percentage error term obtained after adjusting an ARIMA model to a gas demand series. The methodology consisted of an iterative process in which copula functions were used to relate the residual processes with the input variables, as an alternative to the traditional way in which these input variables are included in the ARIMA equation (ARIMAX model).
The work presented in this paper is a generalization of this methodology, incorporating some of the popular ideas of machine learning methods. For example, there is a selection process in which error terms are linearly (in an additive sense) and sequentially predicted through copula functions, identifying the best pair (input variable, copula) for this end. It remembers the stepwise selection process, sometimes used in adjusting regression model, but with the difference that each variable can participate several times in a similar way that they can do in a random forest or a gradient boosting model. During this iterative process, training and validation datasets are distinguished, using an early stop criterion frequently used in well-known methods such as neural networks or the mentioned trees ensemble methods. The methodology has been tested over several reference datasets unlike the most competitive machine learning methods, achieving better results in some of them.
The paper is structured as follows: Section 1.1 consists of a revision of the state of the art about the use of copulas for making predictions; in Section 2, the proposed forecasting methodology based on this kind of functions is introduced; in Section 3, this methodology is applied to several reference datasets, comparing the corresponding results with the achieved ones by the most competitive machine learning techniques; finally, in Section 5, the conclusions and the future lines of work are presented.

State of the Art
Nowadays, there are many fields in which copula functions are used to model multivariate relationships. They are usually associated mainly with the Economy,but their use has spread to various sectors, finding multiple applications in financial [4], insurance [5], energy [6], meteorological [7] and, more recently, forestry and environmental sciences [8].
From a mathematical point of view, copulas are frequently associated with simulation [9]. In this context, copulas are frequently used in risk management [10], where they allow simulating future scenarios taking into account the financial structure of the market. So, S. Ortobelli et al. [11] compared the performance of several reward risk strategies based either on simulated data through copula functions or on historical ones, demonstrating the better performance of strategies valued on the first ones. Copulas have been used too when available data are scarce and they are insufficient to quantify possible risks associated with especially adversarial events, allowing the simulation of samples of pairs of copulated outliers. For example, in the work of R. De Matteis [12], extreme-value copulas [13] were used to simulate the simultaneous occurrence of highly expensive building and contents accidents in an insurance company, and in the work of C. P.Khedun et al. [14], they are used to simulate precipitation anomalies. In the context of outliers, there are some references associated with the use of copulas for the detection of this type of data. For example, T. Bellini [15] uses elliptical copulas to detect multivariate atypical observations, and K. Domino [16] uses the t-Student copulas to artificially sample outliers.
Moving between the fields of simulation and prediction, copula functions are frequently used in time series analysis, finding an interesting review of copula-based models for economic and financial time series data in the work of A. Patton [17]. These series are usually dominated by a random walk component [18], as the aim is frequently focused on forecasting not its average value but its variance. In this context, there are some interesting contributions from copula modelling. O. Sokolinskiy et al. [19] forecast volatility one-day ahead for a financial index, finding that the copula-based realized volatility model (C-RV) outperforms conventional forecasting in terms of accuracy and efficiency. A. Kresta [20] analyzes the applicability of the copula-GARCH model in portfolio optimization, simulating the evolution of financial time series and demonstrating that they provide better forecasts than a benchmark based on bootstrapping techniques. There are also studies that make interesting comparisons between the accuracy of the copula-GARCH and Dynamic Conditional Correlation (DCC) models for forecasting the Value-at-Risk (VaR) and expected shortfall of bivariate portfolios [21,22]. Regarding the VaR metric, S. Guharay [23] proposed a more robust estimation of it based on copula functions. Other authors make comparisons between the well known Capital Asset Pricing Model (CAPM) and copula functions to analyze the co-movement and dependence structure between indexes. So, R. Luo and M. Bhatti [24] use Gaussian, symmetrized Joe-Clayton, and Rotated Gumbel Copulas to conduct this kind of analysis on Islamic investment fund data, while F. Mansor et al. [25] proposed CAPM as an alternative to complex copulas and DCC models. Finally, within this field of capturing co-movements between time series associated with indexes, C. Nguyen [26] proposed a new class of mixed copulas from Clayton, Joe, Gumbel, and Joe copulas, achieving interesting results for investors to configure its portfolios. Di Clemente [27] proposes an interesting methodology for measuring and optimizing the credit risk of a portfolio following a copula-based approach.
Finally, within the scope of forecasting, but outside the one of time series, it is well known that the use of vine copulas [28,29], based on the theory introduced by H. Joe [30] and T. Dedford and R. Cooke [31]. These copulas allow the construction of multivariate distributions from simple building blocks called pair-copulae, decomposing a multivariate non-Gaussian distribution into a product of conditional and unconditional distribution functions, not requiring natural conditional independence assumptions. Certainly, some authors [29] solve prediction problems using copulas, showing that algorithms based on them achieve better results than linear regression models. However, to the extent of the knowledge of the authors of this paper, these algorithms are not usually compared with the provided ones by other popular machine learning methods such as, for example, neural networks, random forest, or gradient boosting.
As some of the cited works make a comparison between copula and DCC, CAPM or regression models, in the same way, in the present work, a comparison between the aforementioned machine learning techniques and the Additive Decomposition Algorithm Based On Copulas (ADABOC) is carried out. This comparison demonstrates that our proposal achieves competitive results.

Materials and Methods
This chapter details the methodology proposed for predicting an interval variable using copula functions. In Section 2.1, a brief introduction to these functions will be presented to facilitate the understanding of the mentioned methodology, which is detailed in Section 2.2.

Copula Functions: Preliminary Concepts and Results
The word copula was originally used in a statistical context in 1959 by the mathematician Abe Sklar in a theorem which bears his name [32]. With this term, the author referred to functions that join (or copulate) multivariate distribution functions to their unidimensional marginals [33].
Essentially, a copula is a multivariate cumulative distribution whose marginal functions are distributed according to standard uniforms. The exposition will be focused on the bivariate case because, apart from being the most studied and referenced in the literature, it is the needed one to understand the proposed algorithm. 1] with the following properties:

Definition 1 (Bicopula function). A bicopula is a function C
C is non-decreasing: for each hyperrectangle B = [u 1 , Sklar's theorem is the main result of the copula theory, since it establishes the relationship between the joint distribution and univariate marginals through this type of function.
Theorem 1 (Sklar Theorem). Let X and Y be random variables with marginal distribution functions F X and G Y , respectively, and a joint distribution function H XY . Then, there exists a copula C such that: If F X and G Y are continuous, then C is unique. If not, C is uniquely determined in Range(F X ) × Range(G Y ). Reciprocally, if C is a copula and F X and G Y are distribution functions, then the function H XY (x, y) = C(F X (x), G Y (y)) is a joint distribution function with marginals F X and G Y .
It is important to remark on the observation referring to the non-continuous case, as most applied papers in the literature that use copula models that deal with continuous data. Certainly, [34,35] talks about whether it should be advised to make inferences in copula models with discrete data. The author of the first one argues that modeling a discrete distribution with a parametric copula and its marginal functions can be very difficult. This is the reason that the continuous case will be considered.
On the other hand, note that, if X is a random variable with a distribution function F X (x) (F X (x) = P[X ≤ x]), then F X (X) can be seen as a random variable too. As a result, if F X is invertible, then: Observe that this is the definition of a standard uniform distribution. For this reason, F(X) (and G(Y)), can be considered as an uniform variable U (V, respectively). So, Equation (3) can be rewritten as: In other words, Sklar's theorem allows us to estimate the joint distribution function H XY , finding the copula function C, which better fits to the values (u, v) of the uniform variables (F X , G Y ) = (U, V). The relevance of this result lies in the fact that a lot of copula functions which reflect a different types of relationships have been studied, making the identification of a good estimation for H XY easier.
Sklar's theorem can be used to identify the independence case between variables too. In fact, when the copula function that best fits the uniform pairs (u, v) is the product copula (6), X and Y can be considered as independent variables.
This is easy to demonstrate: It has already been emphasized that we need to know the conditioned distribution function G Y|X to make predictions of the Y variable from the one X. As G Y|X distribution is difficult to be directly estimated, it seems reasonable to use copulas to this end, as it is important to understand the role these functions play in the characterization of G Y|X . For this, conditioned copulas associated with a copula, C, will be introduced.
Definition 2 (Conditioned copulas associated with a copula). Let C be a copula function.
Fixed U = u, copula conditioned to u is a function of V variable: It can be demonstrated (see Theorem 2.2.7 in [33]) that these partial derivatives exist almost certainly for all u and v, except Lebesgue null measurement sets, and that they almost certainly do not decrease in the unit interval.
So, as a consequence of the adaptation of Sklar's theorem to continuous conditioned distributions, the next result is presented [36].
Proposition 1 (Conditioning with copulas). Let C be a copula function and C 1 (u, v) be the derivative of C(u, v) with respect to U. If the joint distribution of X and Y is given by H XY (x, y) = C(F X (x), G Y (y)), then the conditional distribution of Y|X = x is given by: Observe again that in the independent case (C which is independent of the value of U.

Prediction Algorithm Based on Bivariate Copula Functions
In this subsection, a method for adjusting supervised models to interval variables through bivariate copula functions is proposed. It is important to note that the term "copula" will be used even when the algorithm actually refers to bicopulas, meaning the bidimensional version of this function.
The adjustment process consists of an iterative algorithm that refines a starting basic predictor defined by the average of the values of the dependent variable. This one generates an initial error term. In the first step, the aim is to identify the variable which better explains it through a copula function. The selection of the most appropriated copula to this end is another task to solve in the algorithm. Then, as a result of applying the pair formed by the input variable and the copula function to explain the first error term, a new one is generated, which leads to the start point of step two. This process is repeated until a stopping criterion is satisfied.
In summary, three phases were sequentially applied in each one of the steps of the iterative algorithm:

1.
Adjustment phase: selection of the copula function C * , which best models the relationship between explanatory variables and the residual term obtained in the previous step.

2.
Prediction phase: use of the conditioned copula C * 1 to estimate the value of the error term derived from the previous predictor, which is updated by adding up the mentioned estimated value.

3.
Assessment phase: Evaluation of the goodness of the new predictor over a validation dataset. Mean Absolute Error (MAE) has been the metric used to this end in the proofs presented in Section 3.
On the other hand, three are three possible criteria that can stop the algorithm: 1.
Independence criterion: the best copula C * selected is the product one, indicating the independence between explanatory variables and the error term to be predicted.

2.
Early stopping criterion: the evaluated metric has not significantly improved during the last steps.

3.
Maximum iterations criterion: the maximum number of iterations prefixed is reached.
The methodology of the proposed algorithm, schematically illustrated in Figure 1, is common to any supervising model to build a predictorŶ of the dependent variable Y that optimizes the value of a metric previously established. The fact that each predictor could be generated from simulated values is an advantage. Apart from averaging them to obtain the usual mathematical expectation, it is possible to construct a more robust predictor, such as the median, to avoid the effect of possible outliers. From here, the mean predictor will be considered so as not to favor the results provided by the algorithm proposed against other machine learning techniques tested in Section 3, severely affected by the effect of outliers (such as the GLM model).
Find the pair (X,C) k which lowest AIC over Let Ŷ 0 = Y Let the error ε 1 =Y-Ŷ 0 and a counter k = 1 Use C to calculte an estimator of Vk|U⇒ Vk In any case, the predictor is constructed from the input variables {X 1 , . . , n} being a sample of values corresponding to them. Uppercase and lowercase letters are used to refer to random variables and observed data, respectively. The value in brackets i refers to each one of the n observations.
In the adjustment phase, three different datasets or samples of observations were used: one for training, another one for validation, and the third one for the test, with sizes of n 1 , n 2 and n 3 , respectively. The training sample was used to decide, step by step, the input variable and the copula functions involved in the construction of the predictors. These were evaluated, respect to the prefixed metric, over the validation dataset. The aim of the latter step is to favor the generalization capacity of the predictor, which will be finally tested using the test dataset. As has been previously mentioned, the MAE will be considered as the metric to be minimized: Note that this metric is averaged over the n 2 observations associated with the validation table.
In addition to the use of a validation table, there are two additional parameters taken into account to avoid possible overfitting effects. Both of them also allowed us to reduce the computational time cost of the algorithm: • On the one hand, an early stopping criterion [37] has been established. According to this criterion, the adjustment process stops if the assessment of the metric over the validation sample does not improve their values after several iterations initially prefixed by the earlyStoppingInterations parameter. • On the other hand, different subsamples of the training dataset are considered in each one of the iterations of the process. The size of these subsamples is specified through the subsamplePercent parameter, which specifies the percentage of the n 1 training dataset observations randomly selected in each iteration. This modeling strategy is, for example, used on stochastic gradient boosting algorithm [38] as an improvement of the traditional gradient boosting method [39]. The influence of this parameter is tested in Section 3.2.
Next, the tasks and calculus involved in each one of the steps of the algorithm will be detailed.

ADABOC Algorithm
To start with, a beginning predictor, given by the average value of Y, is considered. This mean value is calculated with the n 1 observations conforming to the training dataset: This predictor generates a first variable associated with the corresponding error term: and through it, a first MAE is obtained, which can be evaluated in the three datasets defined above: The idea is to refine throughout a maximum of iterations, specified by the maxiter parameter, the predictor initialized in Equation (10). This refinement process consists of identifying, in each iteration k, the copula function that better links the sequentially obtained variables ε k with some of the explanatory ones X j . To this end, a set of copula function families to be tested must be previously established: To ensure both X j and ε k are in the definition domain of the copula functions (the unit rectangle [0,1] × [0,1]), these variables must be transformed through their cumulative distribution functions, F X j and G ε k , respectively. It can be observed that only G depends on the subscript k. This is because the variable X j , and therefore the function F X j , will not change from one iteration to another. In contrast, ε k represents the dependent variable to be predicted in the k th step, which has been modified according to the predictor constructed in the previous one (see Equations (10) and (11)). For that reason, the subscript j represents the independent variables (X j or the transformation F X j (X j ) = U j ), while the subscript k will refer to the residuals obtained by predicting the dependent variable (ε k or the transformed one G ε k (ε k ) = V k ). Note that these are the only terms that will be modified in each iteration.
Let {u 1 (i), u 2 (i), . . . , u m (i), v k (i)} be the uniform values obtained from the data {x 1 (i), x 2 (i), . . . x m (i),ε k (i)} through the marginal functions F X 1 , F X 2 , . . . , F X m , G ε k associated with X 1 , X 2 , . . . , X m , ε k variables, respectively. Observe that a hat symbol is used to distinguish between the error variable ε k and its estimatorε k . The corresponding estimated values for the latter, calculated in each one of the iterations, will be referred to asε k (i).
The task of identifying the marginal functions can be completed by trying to find the closer theoretical distribution from a list of the most common ones. This way, for each random variable, it would be necessary to propose hypothesis tests associated with popular distributions and select the one that provides a lower p-value. However, this could be inadequate in case none of them would adjust well enough. Alternatively, kernel estimators have been used in the proposed method. Other authors (see [12]) use the continuous version of the empirical distribution function instead.
Once the uniform sample has been generated, the following task consists of identifying, for each one of the U j variables (associated with each one of the X j input variables), the copula function that best fits the pairs (u j (i), v k (i)). This task is carried out in two stages: • Firstly, one copula associated with each one of the p families is selected. The selection process consists of estimating the value of the parameters that the family depends on. The estimation is carried out using the training dataset, applying the maximum likelihood method as proposed in [12]. Proceeding this way, a total of m · p copula functions, C r j,k , is obtained, one per input variable and copula family. • Secondly, representative values of the fitting goodness of these m · p functions to the pairs (u j (i), v k (i)) are calculated. To this end, some metrics can be acquired, such as the value of the likelihood function, the Akaike information (AIC), or the Bayesian inference criterion (BIC). The first metric has been used due to several authors [12,40,41] considering it as a good criterion. Once these m · p values have been calculated (see Table 1), the smallest of them is considered representative of the best fitting to the training data (u j (i), v k (i)). Let (X * j , C * j,k ) be the pair chosen as the optimal one at the end of the k th step. To simplify the notation in the algorithm, C k * will be used, occasionally, to denote the copula function chosen in this iteration: At this point, the independence stopping criterion is tested. This way, in the case in which C k * = Π, there would be not able to explain the error term through a copula function. As a result, independence between all the variables X j and ε k is concluded and the algorithm stops. Otherwise, a predictor of ε k conditioned to the values of X * j is generated: According to this expression, the expected valuesε (k=1) (i) are estimated for each value x j (i) of the variable X * j . This predictor allows one to obtain a new predictor for the dependent variable Y: Observe that the conditioned mathematical expectation of Equation (14) must be calculated using the density g ε k |X * j , which is unknown. However, the already estimated function C * k contains information about the dependency relationship between X * j and ε k through the transformed variables U j = F X * j (X * j ) and V k = G ε k (ε k ). Hence, it can be used to estimate g ε k |X * j . Specifically, it consists of simulate values of the variable V k |U j = u j , being u j = F X * j (x * j ), detransform them using g −1 ε (k=1) to obtain the corresponding values of the variable ε (k=1) and finally average the latter.
The simulation of the values of the variable V k |U j = u j can be carried out using the inverse transform method [42]. This method requires that the function C * j,k|1 (or C * k|1 to simplify) admits an explicit expression. This function is the conditioned copula derived from the known copula C * k , in which the subscript "1" is used to refer to the conditioning with respect to the first of the variables (U j ) of the pair (U j , V k ). However, there are copula functions for which this explicit expression does not exist and, as a result, the corresponding inverse function can not be expressed in a closed-form [12] to apply the mentioned method.
Alternatively, an approximation is proposed for estimating Equation (14). The question is how to weigh the g −1 ε (k=1) (v) values to average them or, in other words, how many values must be generated for each v ∈ [0, 1], which is the support of the variable V. To answer this question, note that the density function c * j,k (or c * k to simplify) associated with the known copula C * k , contains information about the proportionality relationship that must exist between the v values to be generated, and so, Z it can be used to assign the mentioned weights. According to this consideration, the proposed method consists of: .., t}. The number of values is specified by the parameter numBins (see Algorithm 1).
be the weight associated with eachε (k=1),s value ∀s ∈ {1, 2, ..., t}. Note that the value c * k (u j , v k ) can be calculated because c * k is known and has an explicit form. On the other hand, f U j (u j ) = [0,1] c * k (u j , v)dv is an area that can be easily estimated using numerical methods. Again, equidistant points are generated for approximating these areas using the numBins parameter. • Letε (k=1) = ∑ t s=1 w s ·ε (k=1),s . • Construct the predictorŶ (k=1) =Ȳ +ε (k=1) .
The variable corresponding to the error associated with the latter predictor of the dependent variable Y is: Figure 2 summarizes the construction of the first predictorŶ (k=1) from original data values (X j , Y). (CEMENT, TARGET) values from Concrete dataset (see Section 3) have been used to this end:

•
The graphs in the first row show, respectively, the pairs of points associated with the original variables (X j , Y), those transformed through the corresponding marginal cumulative distribution function ((U j , V) = F X j (X j ), G Y (Y)), and the density c * associated with the copula C * that best fits the latter. • On the other hand, in the first graph of the second row, the value x j of the variable CEMENT, to which the value of the variable TARGET is conditioned, is marked with a red vertical reference. Next, the value u j transformed by the cumulative distribution function F X j is marked as well. Finally, the last graph shows the conditional density copula c * (v|u j ) associated with this value. This function is used to weight the values of variable TARGET to estimateŶ (k=1) = E[Y|X * j = x j ]. The following step (k = 2) starts again with the search of the pair (C * j,2 , X * j ) that provides a lower AIC in the adjustment process to the values (u j (i), v k=2 (i)). These ones are obtained from (x * j (i),ε k=2 (i)), through their respective cumulative distribution functions. Then, the expected value of ε (k=2) must be estimated for each value of X * j : A new predictor for the dependent variable Y will be generated at the end of this step: The algorithm stops when the number of iterations specified by the maxiter parameter is reached, unless the early stopping criterion is satisfied or the product copula π is selected as the best one in some of the iterations.
Observe the fact that the same input variable can be selected repeatedly to explain the resultant error terms from different iterations. Note that some algorithms, such as random forest [43] or gradient boosting [39], allow this kind of circumstance. In fact, they allow the same variable not only to participate in different trees, but to be used several times in the same tree.
Note too that the predictors adjustment process is carried out sequentially, and so each one of them is conditioned by the order of selection of the explanatory variables-that is, it will not obtain the same result by predicting Y as a function of X 1 and the resulting residual from X 2 , by predicting Y as a function of X 2 , and the resulting residual from X 1 . In fact, neither the family of copulas selected for each one of these variables proceeding in one way or another nor the estimation of the parameter they depend on would be the same.
Alternatively, the predictor could be built by using, simultaneously, all the explanatory variables at a time, through a (m + 1) − copula that would link the target variable Y with the m independent ones X j . This way, the same input variable could not enter more than once in the model and the multidimensional parameter estimation the copula depends on would be carried out jointly, as occurs, for example, in a regression model. However, as it has been pointed in Section 1.1 that attempting to capture dependency structures using a single multivariate parametric copula could be an arduous task. The fact that the variety of the studied m-dimensional copula functions, when m is greater than 2, is significantly smaller than those available in the bidimensional case, which is the most referenced in the literature, must be taken into account. This would reduce the spectrum of them available to identify the dependency relationship and, as a consequence, the accuracy of the predictions.
Once the training process has concluded, the predictors generated in the different iterations are compared with respect to the MAE metric. This one is measured over the validation dataset to choose the best estimator.
Thus, the final predictor will present the expression: where k * is the iteration in which the best value of the MAE metric over the validation dataset has been achieved. The described process has been coded in Algorithm 1. The output returned by this algorithm (see Equation (20)) is an object that contains the predictorŶ k * and the information needed to score any dataset that contains the same input variables as those used in the adjustment process. Algorithm 2 details this scoring process.
The object defined by Equation (20) consists of a list of 4-tuples, where the components of the kth one are: 1.
X j k -the kth input variable entered in the kth step to explain ε k . Note that subscripts j k could refer to the same variable for two distinct iterations, meaning that the same variable can participate in different steps; 2.
F X j k -the cumulative distribution function associated with X j k . This function is used to transform an x j k value into a u j k value; 3.
C * k -the copula function adjusted to (u j k , v k ) in the k th iteration; 4. data k = (x j k (i), ε k (i))-the data associated with all the observations "i" of training and validation datasets. This information is useful to avoid repeating the calculus of ε k with copula function C * k (through the corresponding conditional function) for values of X j k used in the adjustment process and for values of X j k contained in training and validation datasets. So, the second component F X j k , will be used only in the case where during the scoring process a value of X j k different to the values registered in these samples appears. This allows one to reduce the computational time cost involved in the scoring process of a new dataset.
So, the predictor is by constructed decomposing the dependent variable in a relationship of error terms that have been estimated from copula functions. This is the reason the proposed method has been named the Additive Decomposition Algorithm Based On Copulas (or ADABOC).Ŷ 4 =Ȳ +ε 1 +ε 2 +ε 3 +ε 4 By way of illustration, Figure 3 shows the MAE variation carried out by Algorithm 1 along the validation sample associated with the Concrete dataset (see Section 3). The pair copula (variable, copula) = (X j k , C * k ) that has been selected in each step k is specified in the boxes. So, for example, the variable named CEMENT and a SurvivalBB1 copula have been selected in the first iteration, the variable named SUPERPLASTICIZER and a Frank copula have been selected in the second one and so on. .., p n = n 1 + n 2 subn 1 =round(subsamplePercent * n 1 ) Num of train observts. randomly selected for iteration counterNoImprovement Cumulative number of iterations without improvement counter y 0 (i) = 1 n 1 ∑ n 1 s=1 y s , ∀i = 1, 2, ..., n ε (1) .., n ∀j = 1, 2..., m Pair in copula dom.
(C k * , X j k ) = argmin j,r {AIC(C j,k r )} The minimum and the variable in which it is reached AIC(C j,k r ) = fitVarCop(C r , subn 1 , (u j k , v k )) AIC by fit copula to subn 1 rand pairs of train if C k * = Π Independence criterion then For each input we store needed data for scoring other setŝ y k (i)=ŷ k−1 (i) + E ε Calculate the new Y estimator ε k+1 (i) = y(i) −ŷ k (i) Calculate the new error variable MAE k+1 = 1 n 2 ∑ n 2 l=1 |ε k+1 (l)| On validation set if it exists (else over training set) if round(MAE k+1 , epsilon) < round(MAE k , epsilon) Early stopping criterion then counterNoImprovement = 0 k * = k Index associated with the iteration with minimum MAE else counterNoImprovement = counterNoImprovement + 1 The algorithm stopped in the 28th iteration as the best copula selected in this step has been the product copula π. It means that the residual process resulting after 28 iterations is independent with respect to all the explanatory variables. So, as the minimum MAE value calculated over the validation table has been achieved in iteration 21 (red point in Figure 3), the final predictor is composed as follows:

Function
ADABOCPredict.R implements Algorithm 2. It depends on the next input parameters: • model (required): name of the object returned by the ADABOC.R function; • scoreDataset (required): name of the dataframe associated with the dataset to be scored. It must contain the input variables involved in the predictor given by the previous parameter; • prediction: output returned by the function. It contains the predictions obtained for each one of the observations i stored in scoreDataset. Observe that the expected values are given byε (k) , which must only be calculated for the values x * j (i) not registered in trainingDataset or tvalidationDataset. So, the estimated ones already generated by the ADABOC function during the adjustment process are used.
score } Dataset to be scored (X j k variables must be in copulaModel)

Results
In this chapter, ADABOC and other competitive machine learning models are compared over some typical datasets.

Datasets Used
The datasets used have been extracted from Knowledge UCI: Machine Learning Repository (archive.ics.uci.edu/ml/datasets, accessed on 25 May 2021), imposing some filters on the search criteria: • The dependent variable must be of the interval type but there must not be a temporary dependence between its values. ADABOC is not oriented to the modeling of time series. • Datasets must be big enough to make it reasonable to apply machine learning methods. Thus, the number of rows of the dataset must be between 1000 and 10,000 and the number of attributes must be higher than 5.
Taking into account these filters, the datasets considered are shown in Table 2. Next, some observations are made: • Dataset 2 has two target variables: next-day maximum and minimum air temperatures.
To consider only one problem associated with this dataset, next-day medium air temperature has been considered. This one has been calculated as the average of the two target variables. • Dataset 8 has two target variables too: motor_UPDRS and total_UPDRS. The relative positions between the different tested models are similar with one another, which is the reason why only one of them (total_UPDRS) has been considered. • Dataset 4 is the same that Dataset 3 but adding two additional input variables and not normalizing data. That is the reason why only the first one has been considered.
• Dataset 7, available in the UCI repository associated with the problem KDD Cup 1998 Data, has been replaced by the one available in www4.stat.ncsu.edu/~dickey/Analytics/ Datamine/data (accessed on 25 May 2021) as the number of input variables of this dataset is considerably lower, allowing the reduction in the computational cost of the proof tested while keeping a good predictive capacity (with this dataset, the SAS Institute achieved the second position in the KDD 1998 competition). • Each one of the datasets has been split into training, validation and test samples, with 40%, 30% and 30% of the observations for each one, respectively. The corresponding sizes are shown. In addition to the filters imposed, some of the variables of the datasets have been dropped: • On the one hand, the ones with missing values were removed. This is justified by the nature of some of the models used in the comparison. For example, random forest and gradient boosting models allow these kinds of values. However, ADABOC as well as some of the other models taken into account in the comparative, such as GLM and neural networks, need the imputation of the mentioned values, with a lot of methods available for this task: to impute by using the mean or the median value, to impute according to the distribution of the variable, predict them, etc. Thus, with the aim of isolating the effect that these methods could have on the final results obtained, these variables have been removed. • On the other hand, the ones of nominal nature (factors) were removed. This is justified by the convenience of not making inferences with copulas using non-continuous variables (see Section 2.1).

Experiments
Algorithm 1 has been applied to the datasets described in the previous section. It could use any type of copula functions. Specifically, taking into account that the algorithm has been implemented depending on BiCopEst (Parameter Estimation for Bivariate Copula Data) R-function of the VineCopula package to select the copula with best AIC value, the ones contained in this function have used (see Table A1 in Appendix A). Only Tawn's copulas have been excluded as they severely increased the computational time cost without any noticeable improvement in the results achieved.
Regarding the parameters of the function itself, all of them take the default values presented in Section 2.2. Due to the fact that there is a validation table, the earlySopping Iterations parameter has been used. Its value has been set to 10 iterations. Additionally, several values of the subsamplesize parameter have been tested, from 10 to 100 with a bystep value equal to 10. Table 3 allows comparing of MAE results over validation table, according to the variation of this parameter. In general, it can be noted that there are no significant differences between them and that higher sizes do not provide better results. This is an important observation as it allows reducing of the computational time cost of the model, making use of small samples to train it. Then, the predictor associated with the best results obtained for this parameter over each validation dataset (highlighted in bold in Table 3) was applied over the test sample. The corresponding results are the ones presented in the row identified as ADABOC in Table 4. It can be observed that theses values are similar to those obtained over the validation table, avoiding overfitting effects. On the other hand, some popular machine learning models have been adjusted to compare the results they provide with ADABOC. To this end, h2o.autoML function, which invokes to H 2 O platform, has been used. This function adjusts the following type of models: Tuning of the parameters associated with each of these models was carried out by h2o.autoML function. For instance, in GBM, tree depth and shrinkage parameters are allowed to vary on a grid of values, while in NN different architectures are tested.
Additionally, the h2o.autoML function allows specification of the metric to be optimized along the adjustment process of each model over the validation table (specified through the validation_ f rame parameter). That metric is specified through the stopping_metric parameter (MAE has been used). Results, obtained with the best model identified for each one of the mentioned types, are compared over the test dataset (specified through the leaderboard_ f rame parameter) in Table 4. The results corresponding to the winner model are remarked in bold.
It should be noted that these results were obtained with the execution of the h2o.autoML function conditioned by the value of a random seed whose value was set to 12345. Fixing the value of this seed allows the results to be replicated. However, the documentation of this function aims for replication to not be possible if NNs are included in the modeling process. For this reason, two executions of this function have been made: • The first of them excludes NN models. In this one, it has been specified that the function adjusts a maximum of 10 models. According to these considerations, in each execution, 5 GBM, 3 XGB, and 1 of each of the other types mentioned (GLM and DRF) are compared (see Table 5). • The second one includes only NN models. The number of them have been fixed to 10 (see Table 6).
A reasonably high value has been specified in the computation time of each of these executions, which have guaranteed the complete execution thereof. This value have been specified through the max_runtime_secs parameter of the h2o.autoML function. It was set to 10,800 s (3 h) in the first execution and to 25,200 (7 h, although they were not spent) in the second. The fact that it has been significantly greater in the latter is justified by the high computational time needed by the neural network models.  In Table 4, results corresponding to the best modeling strategy have been highlighted in bold. A baseline associated with the MAE of predictions is given by the intercept term of a regression model only, and the average of the target variable over the training dataset is shown too.
It is important to remark that the aim is to compare ADABOC with other well known machine learning methods. A previous phase of processing data would allow better results to be achieved with any of them. However, this phase was been carried out as the purpose is to test if the algorithm is competitive versus other techniques, when all of them are trained and tested with common datasets.

Discussion
According to the results shown in Table 4, ADABOC seems to be a competitive machine learning algorithm. In fact: • It provides the best results in 2 of the 7 forecasting problems, CommunitiesU and KDD1998, even improving the ones obtained with NN and GBM, the best models globally. • In the dataset Concrete, it improves the results given by DRF, XGB and GLM models, , which are not far from the other ones. In fact, it is superior to two of the GBM models and five of the adjusted NN models (see Tables 5 and 6, respectively). • It is always better than GLM, which is the worst model, although the predictions of the latest cause the MAE given by only an intercept term to reduce. Thus, in any case, it can be considered as a better alternative to a linear model.
Not in vain, the ADABOC methodology incorporates some of the modeling strategies used by other well known machine learning techniques such as the use of an early stopping criterion or the use of subsamples of the training dataset in each one of the steps of the iterative process. Next, other relevant points in common with some of these methods are presented: • On the one hand, it can be observed that the same input variable can be selected repeatedly to explain the resultant error terms from different iterations. Note that some of the algorithms, such as random forest [43] or gradient boosting [39], allow this kind of circumstances. In fact, they allow the same variable to not only to participate in different trees but to do it several times in the same tree. In addition, in the proposed algorithm, it could be allowed that, in each one of the iterations, p (≤ m) explanatory variables were randomly selected as models, as in random forests. This way, it would avoid those with greater predictive power being repeatedly selected, in favor the rest of them. • On the other hand, it can be observed that the predictor adjustment process was carried out sequentially, and so each one of them was conditioned by the order of selection of the explanatory variables-that is, the same result will not be obtained by predicting Y as a function of X 1 and the resulting residual from X 2 , by predicting Y as a function of X 2 and the resulting residual from X 1 . In fact, neither the family of copulas selected for each one of these variables proceeding in one way or another nort the estimation of the parameter they depend on would be the same. Alternatively, the predictor could be built by using, simultaneously, all the explanatory variables at a time, through a (m + 1) − copula that would link the target variable Y with the m independent ones X j . This way, the same input variable could not be entered more than once in the model and the multidimensional parameter estimation the copula depends on would be carried out jointly, as occurs, for example, in a regression model. However, as it has been pointed in Section 1.1, attempting to capture dependency structures using a single multivariate parametric copula could be an arduous task.
As an intermediate possibility, making an adjustment based on 3-copula functions could be considered,. This means that, in each iteration, the copula would relate each error term, ε k , with a couple of explanatory variables. This way, possible effects associated with interactions between variables could be reflected in the same way that would be achieved with the regression techniques or with the ensemble modeling mentioned when handling decision trees with depth values higher than 1. It must be taken into account that the variety of m-dimensional copula functions studied when m is greater than 2 is significantly smaller than those available in the bidimensional case, which is the most referenced in the literature. This would significantly reduce the spectrum of those available to identify the dependency relationship and, in consequence, the accuracy of the predictions.
Finally, it must be taken into account that the predictor generated by the Algorithm 1 could be combined through a stacking strategy with the ones provided by other techniques, to try to achieve an improvement of the results.

Conclusions
In this paper a novel methodology for forecasting interval variables based on the use of bivariate copula functions is proposed. The core of the methodology is an iterative algorithm that identifies those bicopulas that best fit the dependency relationships between the independent variables and the different error terms obtained during the adjustment process. The method has been tested on several known datasets, establishing a comparison with other popular and competitive machine learning techniques, achieving satisfying results.
Nowadays, the authors are working on several lines of improvement of this algorithm: • First of all, ensuring the ability to handle variables of a discrete nature, in a similar way that the vine copula regression method proposed in [29] do; • Secondly, the possibility of assigning higher weights to higher value errors, in a similar way that boosting methods do to improve the training of the method. • Thirdly, the capacity to include analysis of the interactions between the independent variables use in the model. The proposed method has already shown the concept of interaction since the predictor mixes different explanatory variables in its definition. However, the use of 3-copula functions that would allow relating pairs of independent variables with the successive error terms obtained during the adjustment could be an interesting possibility [44] as it would capture more complex dependency relationships and possibly improve the predictive capacity of the method. • Finally, although the number of problems in which there is not a single dependent variable is significantly lower, the possibility of adapting it to problems in which there is a vector of dependent variables is also being evaluated. In this regard, the concept of conditional copula associated with the variable (X, Y) | W introduced by [45], which would allow the prediction of the pair (X, Y) based on the value of a dependent variable W, is being considered.
The final aim is that ADABOC achieves more competitive results and works regardless of the amount and the nature of exogenous and endogenous variables involved in the model.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.