Regression Models for Symbolic Interval-Valued Variables

This paper presents new approaches to fit regression models for symbolic internal-valued variables, which are shown to improve and extend the center method suggested by Billard and Diday and the center and range method proposed by Lima-Neto, E.A.and De Carvalho, F.A.T. Like the previously mentioned methods, the proposed regression models consider the midpoints and half of the length of the intervals as additional variables. We considered various methods to fit the regression models, including tree-based models, K-nearest neighbors, support vector machines, and neural networks. The approaches proposed in this paper were applied to a real dataset and to synthetic datasets generated with linear and nonlinear relations. For an evaluation of the methods, the root-mean-squared error and the correlation coefficient were used. The methods presented herein are available in the the RSDA package written in the R language, which can be installed from CRAN.


Introduction
Statistical and data mining methods have been developed mainly in cases in which variables take a single value. In real life, there are many situations in which the use of these type of variables can lead to an important loss of information or result in time-consuming calculations. In the case of quantitative variables, a more complete description can be achieved by describing an ensemble of statistical units in terms of interval data, that is the value taken by a variable is a closed interval in the real numbers.
It is especially useful when it is convenient to summarize large datasets in such a way that the resulting summary is of a manageable size and still maintains as much information as possible from the original dataset. For example, suppose we want to substitute the information of all transactions made by the owner of a credit card with a unique "transaction" summarizing all the original transactions. This is achieved thanks to the fact that this new transaction will have in its fields not only numbers, but also intervals defined by, for example, the minimum and maximum purchase.
The statistical treatment of interval-valued data was considered in the context of symbolic data analysis (SDA) introduced by E. Diday in [1], the objective of which is to extend the classic statistical methods to the study of more complex data structures that include, among others, interval-valued variables. A complete presentation on symbolic data analysis can be found in [2][3][4].
Research on SDA has focused primarily on unsupervised learning, with few contributions in the field of regression, which have been made mainly based on linear models. In this work, we explore nonlinear regression algorithms in conjunction with the center method and the center and range method for interval-valued data to try to improve the results involving classical linear regression for the center and center and range methods proposed in [5,6], respectively, and the extended lasso and ridge regression for intervalvalue data proposed in [7]. In this way, we are able to study how more sophisticated algorithms can improve the traditional models based on linear methods. Furthermore, we extend the tool kit of algorithms available for regression on intervalvalued data, taking advantage of the properties and the large number of algorithms for real-valued data. We explore the classical linear regression models, tree-based regression models (regression trees, random forest, and boosting), K-nearest neighbors regression, support vector machines regression, and regression using neural networks. Section 2 gives a summarized presentation of the different regression models for real-valued data. Section 3 presents a summary of the center method and the center and range methods in the context of each regression model considered. Section 4 presents an experimental evaluation with a real dataset and simulated datasets, which evidences the differences and improvements among the models. Finally, Section 5 gives the concluding remarks based on the results.

Regression Methods
In this section, we provide a short review of the main regression methods and their mathematical formulation.

Classical Linear Regression Methods
We present only a summary of the classical linear regression model, while a complete presentation can be found in [8,9]. Linear regression models for decades have been some of the most important predictive methods in statistics; in fact, this continues today as one of the most important tools in statistics and data mining. The idea is, given an input vector x t = (x 1 , x 2 , . . . , x p ), where p is the number of variables and x t represents the transpose of x, we want to predict the response variable y through the following linear model: where β 0 is called the intercept. If a constant one is included with vector x and β 0 in the coefficients vector β, the linear model can be written in vectorial form as a product as follows:ŷ To fit the linear model in the training data, the most popular estimation method is least squares. In this approach, we pick the coefficients β to minimize the residual sum of squares (RSS): where n is the number of observations in the dataset. RSS(β) is a quadric function; therefore, its minimum always exists. Note that it can be written as: where X is an n × p matrix wherein each row is a vector in the training dataset; y is an n size vector (the output vector in the training dataset). It is well known that, if X t X is a nonsingular matrix, the solution is given by: The approximate value by this model for the component x i can be estimated aŝ y i = x t iβ , and the fitted values for a new case x t = (1, x 1 , . . . , x p ) are given byŷ = x tβ . In practice, there are various methods to find the coefficientsβ, but the existing methods, and in particular the least squares method, are labor intensive and time consuming with large datasets, and others are not accurate enough with these kinds of datasets. A new non-iterative algorithm for identifying multiple regression coefficients based on the SGTMneural-like structure for the case of large volumes of data processing was proposed in [10], where the high efficiency of the method for the accuracy and speed in comparison with the existing methods was established.
Extensions to this model include shrinkage linear regression models, such as the ridge and lasso models. These models impose different types of regularization on the parameters: L2 regularization, used in ridge, and L1, used in lasso. A complete presentation of these models can be found in [8,11].

Tree-Based Regression Models
We now present a summary of the three main tree-based regression models. A complete presentation of these models can be found in [8].

Regression Trees
There are two main steps for regression using decision trees: First, we begin by dividing the predictor space in R n into J non-overlapping regions R 1 , . . . , R J ; second, for every testing example that falls in R j , we predict the response variable as the mean of the response variable over the training examples in R j .
The regions R j could be of any geometrical shape, but for the sake of simplicity and interpretation, we restrict ourselves only to rectangular regions (boxes). Therefore, we are searching for a partition of the predictor space into boxes R j such that they minimize the RSS, which can be written as: It is computationally unfeasible to consider every possible partition of the predictor space; for this reason, the method uses top-down and greedy recursive binary splitting. It is binary because every split of a predictor variable results in a division of the predictor space into two sets; it is top-down because it starts building the tree from the top to the leafs; and it is greedy because, at each step, the best split of the predictor is made without looking ahead and without picking a split that will lead to a lower RSS in some future step.
To construct the tree, first, we consider all the predictor variables and all the possible binary splits for their values. If the variable is numerical, all the possible values s in the range values of the variable are reconsidered, and if the variable is categorical, we consider all possible partitions in two sets of the values of the variable. We then select the variable and the split of that variable that leads to the greatest reduction in RSS. This produces two regions corresponding to two branches of the tree. We then continue looking for the best predictor and the best partition on each of the resulting regions, and the process ends when a stopping criterion is reached. It is common to build a large, complex tree and then prune it in order to reduce over fitting.
Once the regions R 1 , . . . , R J have been created, we can use it to predict the response for new examples as the mean of the response of the training observations in the region to which the new example belongs.

Random Forest
The idea is to build a given number of regression trees on bootstrap sets (that is, obtain distinct datasets by repeatedly sampling with replacement observations from the original dataset) and use, in each tree, a random subset of m of the original predictor variables. Usually, m is taken to be m = √ p, where p is the total number of variables.
In this case, the prediction of a new example is the mean over the predictions of the individual trees. The idea of random forest is to decorrelate the trees, thereby making the average of the resulting trees less variable.

Boosting
In this method, the idea is to sequentially construct trees to repeatedly modify versions of the training data and the loss function, thereby producing a sequence of regression models, G j (x), whose predictions are then combined and weighted according to the error they produce.
Initially, all the N training examples (x i , y i ) have the same weights w i = 1 N . In each following step of the training process, the data are modified by updating these weights w i . At step m, those observations with a higher error by the model G m−1 (x) induced at the previous step have their weights increased, whereas the weights are decreased for those that have a lower error, and those weights in turn are taken into account by the loss function. As a result of this weight actualization, each successive model in the process is then forced to concentrate on those training observations that present higher errors by previous models in the sequence.

K-Nearest Neighbors
Given a number K and a testing example x 0 , we identify the set N 0 of the nearest K training examples to x 0 . The predictionŷ 0 of the response variable for x 0 is the mean of the response variable of the examples in N 0 , that is: We can use any distance between examples, but it is recommended that the distance that minimizes the testing RSS is used. To select the appropriate number of neighbors K, it is recommended that cross-validation be used to compare the RSS of the resulting models, using K = 1, . . . , √ n, where n is the total number of examples. More details on the K-nearest neighbors regression model were presented in [8].

Support Vector Machines
The support vector machines model for regression is an extension of the linear regression model that uses a metric that is less sensitive to outliers using a fixed margin to ignore the errors that are within this margin by means of a -insensitive function in the loss function and defines a hyperplane that is meant to adjust the data and define the prediction.
Given a threshold , the idea is to define a margin such that examples with residues within the margin do not contribute to the regression fitting, while residue examples outside the margin contribute proportionally to their magnitude. Therefore, the outlier observations have a limited effect, and the examples in which the model fits well do not have an effect on the model.
The loss function of this model is given by: where C is the cost penalty, which penalizes large residuals,ŷ i = β 0 + β 1 x i1 + · · · + β p x i p, si |ξ| ≤ is the -insensitive function. We search for parameterŝ β j that minimize LF(β).
A complete presentation of this model can be found in [8].

Neural Networks
For a neural network model, the function f that approximates the true relation of the . In this chain structure, f (i) is called the i-th layer of the network, n is the depth of the network, and f (n) is the output of the network (which in the regression setup is a real-valued function), and the other layers are called hidden layers and are typically vector-valued functions. The neural network model is associated with a directed acyclic graph describing how the functions are composed together, and the idea of using many layers of vector-valued representations is that each one can learn distinct specific patterns in the data.
The training examples specify directly what the output layer must do at each point x; that is, it must produce a value that is close to the true value y. The behavior of the hidden layers is not directly specified by the training data; instead, the learning algorithm must decide how to use these layers to best implement an approximation of the true value y.
Neural networks are usually trained using stochastic gradient descent, which involves computing the gradients of complicated functions, and the back-propagation algorithm is used to efficiently compute these gradients.
Full details of the mathematical formulation of the neural network model can be found in [8,12].

Regression Models for Symbolic Interval-Valued Variables
In this section, we summarize the center method and the center and range method; a complete presentation of which can be found in [5,6,[13][14][15], respectively. We also propose an approach to the center method and the center and range method in the context of the other regression models considered.

Center Method
In the center method, the β parameters are estimated based on the interval's midpoints. In this method, there are predictors X 1 , . . . , X p and a response to be predicted Y, all of which are interval valued. Therefore, X is an n × p matrix, where each row is a vector of components of the training dataset We denote by X c the matrix with the interval's midpoints of the matrix X, that is x c ij = (a ij + b ij )/2, and we denote by y c i = (y Li + y Ui )/2 the midpoints of Y. The idea of the center method is to fit a linear regression model over . . , n and y c = (y c 1 , . . . , y c n ) t . If (X c ) t X c is nonsingular from (5), we know that the unique solution for β is given by: The value of the prediction for y = [y L , y U ] for a new case x = (x 1 , . . . , x p ) with x j = [a j , b j ] is estimated as follows: where (

Center and Range Method
With the center and range method, Lima Neto and De Carvalho fit the linear regression model for interval-valued variables by using the information contained in the midpoints and in the interval ranges, in order to improve the quality of the prediction of the center method. The idea is to fit two regression models, the first with the midpoint of the interval and the second with the ranges of those same intervals. Just like the center method, there are X 1 , . . . , X p predictors and a response Y, and all these variables are interval valued. Therefore, X is an n × p matrix, where each row is a vector of a component of the training To fit the first regression model, we proceed in the same way as in the center method, that is if we denote by X c the midpoint's matrix (x c ij = (a ij + b ij )/2) and we denote by y c i = (y Li + y Ui )/2 the midpoints of Y, the center and range method fits a first linear regression model over . . , (x c n ) t )) t for i = 1, . . . , n and y c = (y c 1 , . . . , y c n ) t . In this case, if (X c ) t X c is nonsingular, then we know that the unique solution for β c is given by: To fit the second regression model, half of the value of the range of each interval is used. For this, we denote by X r the matrix that contains in each component half of the interval ranges of the matrix X, i.e., x r ij = (b ij − a ij )/2, and we denote by y r i = (y Ui − y Li )/2 half of the interval-valued variable Y. The center and range method then fits a second linear regression model over . . , n and y r = (y r 1 , . . . , y r n ) t . In this case, if (X r ) t X r is nonsingular from Equation (5), we know that the solution for β r is given by: so each case in the training dataset is represented by two vectors w i = (x c i , y c i ) and is then estimated as follows: with:ŷ . This model cannot mathematically guarantee thatŷ Li ≤ŷ Ui for all i = 1, . . . n, a problem addressed by Lima Neto and De Carvalho in [6].
Extensions to this model include shrinkage linear regression models for symbolic interval-valued data, which involve a generalization of the ridge and lasso models for interval-valued data. A complete presentation of these models can be found in [7].

Regression Trees Center Method
We begin by dividing the predictor space of interval midpoints in R n into J nonoverlapping regions R c 1 , . . . , R c J . We search for a partition of the predictor space into boxes R c j such that they minimize RSS c , which can be written as: To construct the tree, first, we consider all the predictor variables and all the possible binary splits for their values. If the variable is numerical, all the possible values s in the range values of the variable are reconsidered, and if the variable is categorical, we consider all possible partitions in two sets of the values of the variable. We then select the variable and the split of that variable, which leads to the greatest reduction in RSS c . This produces two regions corresponding to two branches of the tree. We then continue looking for the best predictor and the best partition on each of the resulting regions, and the process ends when a stopping criterion is reached.
Once the regions R c 1 , . . . , R c J have been created, the model can be used to predict the response of a new example x = (x 1 , ..., x p ) with x j = [a j , b j ] as: where c j is the mean of the response centers of the training observations in R c j , x L = (a 1 , ..., a p ) and x U = (b 1 , ..., b p ).

Regression Tree Center and Range Method
We begin by dividing the predictor space of interval midpoints in R n into J nonoverlapping regions R c 1 , . . . , R c J , and by dividing the predictor space of interval ranges in R n in L non-overlapping regions R r 1 , . . . , R r L . In this case, we are searching for a partition of the predictor space of centers and ranges into boxes R c j and R r j such that they minimize the RSS c and RSS r , respectively, which can be written as: To construct the tree, first, we consider all the predictor variables and all the possible binary splits for their values. If the variable is numerical, all the possible values s in the range values of the variable are reconsidered, and if the variable is categorical, we consider all possible partitions in two sets of the values of the variable. We then select the variable and the split of that variable, which leads to the greatest reduction in RSS c and RSS r . This produces two regions corresponding to two branches of the tree. We then continue looking for the best predictor and the best partition on each of the resulting regions, and the process ends when a stopping criterion is reached.
Once the regions R c 1 , . . . , R c J , and R r 1 , . . . , R r L , have been created, the model can be used to predict the response of a new example x = (x 1 , ..., x p ) with x j = [a j , b j ] as: where c c j and c r j are the the means of the response centers and ranges of the training observations in R c j and R r j , respectively.

Random Forest Center Method
In this method, the idea is to build a given number M of regression trees, T c j , on bootstrap sets of the center data, using in each tree a random subset of m of the original predictor variables in X c .
In this case, the prediction of a new example x = (x 1 , ..., x p ) with x j = [a j , b j ] is the mean over the predictions of the individual trees: where x L = (a 1 , ..., a p ) and x U = (b 1 , ..., b p ).

Random Forest Center and Range Method
With this method, the idea is to build a given number M of regression trees, T c j , on bootstrap sets of the center data and a given number L of regression trees, T r j , on bootstrap sets of the range data for each tree, using in each tree a random subset of m of the original predictor variables in X c and X r . In this case, the prediction of a new example x = (x 1 , ...,

Boosting Center Method
We construct trees sequentially and repeatedly modify versions of the training center data, thereby producing a sequence of regression models, G c j , whose predictions are then combined and weighted according to the error they produce.
Initially, all the N training examples (x c i , y c i ) have the same weights w c i = 1 N . On each following step of the training process, the data are modified by updating these weights w c i . At step m, those observations with a higher error by the model G c m−1 (x) induced at the previous step have their weights increased, whereas the weights are decreased for those that have a lower error.
In this case, the prediction of a new example x = (x 1 , ..., x p ) with x j = [a j , b j ] is the mean over the predictions of the individual trees: where α c j measures the error of the j-th model, x L = (a 1 , ..., a p ) and x U = (b 1 , ..., b p ).

Boosting Center and Range Method
We construct trees sequentially and repeatedly modify versions of the training center and range data, thereby producing two sequences of regression models, G c j and G r j , whose predictions are then combined and weighted according to the error they produce.
Initially, all the N center training examples (x c i , y c i ) have the same weights w c i = 1 N , and the same applies to the range training examples. In each following step of the training process, the data are modified by updating these weights w c i and w r i . At step m, those observations with a higher error by the model G c m−1 (x) and G r m−1 (x) induced at the previous step have their weights increased, whereas the weights are decreased for those that have a lower error.
In this case, the prediction of a new example x = (x 1 , ..., where α c m measures the error of the j-th center model and α r j measures the error of the jth range model.

K-Nearest Neighbors Center Method
Given a number K and a testing example x = (x 1 , ..., x p ) with x j = [a j , b j ] in the symbolic dataset, we identify the sets N 0 of the nearest K training examples to x c . The predictionŷ of the response variable for x is given by:

K-Nearest Neighbors Center and Range Method
Given the numbers K c and K r and a testing example x in the symbolic dataset, we identify the sets N c and N r of the nearest K c and K r training examples to x c and x r , respectively. The predictionŷ of the response variable for x is given by:

Support Vector Machines Center Method
Given a threshold c , the idea is to define a margin such that examples with residues within the margin do not contribute to the regression fitting, while residue examples outside the margin contribute proportionally to their magnitude.
The loss function of this model is given by: where: We search for parametersβ c that minimize LF(β), so the value of the prediction for y = [y L , y U ] for a new case x = (x 1 , . . . , x p ) with x j = [a j , b j ] is estimated as follows: where (x L ) t = (1, a 1 , . . . , a p ) and (x U ) t = (1, b 1 , . . . , b p ).

Support Vector Machines Center and Range Method
Given thresholds c and r , we define margins such that examples with residues within these margin do not contribute to regression fitting, while residue examples outside the margins contribute proportionally to their magnitude.
The loss function of the center and ranges models is given by: where: We search parametersβ c that minimize LF(β), so the value of the prediction for y = [y L , y U ] for a new case x = (x 1 , . . . , x p ) with x j = [a j , b j ] is estimated as: where (x c ) t = (1, x c 1 , . . . , x c p ) and (x r ) t = (1, x r 1 , . . . , x r p ).

Neural Networks Center Method
We consider a neural network model on the center data, and the function f c that approximates the true relation of these data y = f c (x) takes the form of a composition f c (x) = f c n • f c n−1 • · · · • f c 1 (x). In this case, the prediction of a new example x = (x 1 , ..., where x L = (a 1 , ..., a p ) and x U = (b 1 , ..., b p ).

Neural Networks Center and Range Method
We consider two neural networks models, one for the center data and another for the range data. The function f c that approximates the true relation of these data . In this case, the prediction of a new example x = (x 1 , ...,

Experimental Evaluation
As done by Lima Neto and De Carvalho in [6], the evaluation of the results of these interval-valued regression models was carried out using the following metrics: the lower boundary root-mean-squared-error RMSE L , the upper boundary root-mean-squared-error RMSE U , the square of the lower boundary correlation coefficient r 2 L , and the square of the upper boundary correlation coefficient r 2 U .
For the experimental evaluation, we used the following hyperparameters for the models.
• Lasso and ridge: folds used in K-fold cross-validation to find the best tuning parameter λ: 10 • Regression trees: minimum number of observations in a node: 20; maximum depth of any node of the final tree: 10. • Random forest: number of trees to grow: 500; number of variables randomly sampled at each split: All examples presented in this section were developed using the package RSDA (R for symbolic data analysis) constructed by the authors of this paper for applications in symbolic data analysis (see [16]

Cardiological Interval Dataset
To illustrate the application of the methods, we considered the data based on [17] and taken from [3], shown in Table 1, in which the pulse rate (Pulse), systolic blood pressure (Sys), diastolic blood pressure (Diast), Art1 and Art2 were recorded as an interval for each of the patients, where Art1 and Art2 are artificial variables added to the data table.
The goal was to predict (Pulse). The dataset consisted of 44 rows and 5 attributes. Table 1 presents a glimpse of the dataset. To measure the performance of the models, we split the dataset into training and testing sets, using 70% and 30%, respectively.  [7,9] After closer inspection of Table 2, it can be seen that the neural network had the best metrics for RMSE L , r 2 L , and r 2 U , and had the second-best RMSE U , the random forest center and range model being the best one. As a result, the neural network model was best for this dataset.
As mentioned before, the neural network center and range model consist of 2 neural networks with 1 layer and 10 neurons, and this selection was made to avoid overfitting the small dataset. This simple neutral network model was able to capture the relation in the data better than the other models, and there was a large difference in performance, even when compared with the second best. In the previous table, the boosting method was omitted because the table is too short for the model to be fitted.

Monte Carlo Experiments and Applications
The usefulness of the methods proposed in this paper was evaluated through experiments with synthetic interval-valued datasets with different linear and nonlinear configurations.

Synthetic Linear Symbolic Interval Datasets
We followed the approach of Lima Neto and De Carvalho in [14] and constructed a symbolic interval dataset with the center and range values of the intervals simulated according to a linear relationship.
Our goal was to construct a symbolic dataset with 375 rows and 4 variables. We used 250 rows as the learning set and 125 as the test set. We followed these steps to construct the synthetic dataset.

1.
The random variable X c j was uniformly distributed in [20, 40].

2.
The random variable Y c was related to the random variables X c j according to The random variable Y r was related to the variable Y c according to Y r = Y c β * + * , in the same manner X r j was related to X c j according to X r j = X c j β * + * , where β * ∼ U[0.5, 1.5] and * ∼ U[i, j]. The next table shows different configurations for the values of a, b and i, j. As per Lima Neto and De Carvalho, these configurations in Table 3 took into account the error at the midpoints combined with the degree of dependence between the midpoints and ranges, with two levels of variability: high variability error U[−20, 20], low variability error U[−5, 5], a high degree of dependence U [1,5], and a low degree of dependence U [10,20].
We present the results for all the datasets. For dataset D1, we were able to see that the lasso, ridge, and LM center and range models were the best.
For dataset D2, the lasso, ridge, and LM center and range models were the best, and the KNN showed the next best performance.
Once again, for dataset D3, the lasso, ridge, and LM center and range model were the best models, without any close competitors.
For dataset D4, the lasso, ridge, LM, and the boost center and range model were the best. Therefore, the boost model was able to perform well in this dataset.
Note that in Tables 4-7, some models have NA on r 2 L or r 2 U , and this is because for those models, either the predictionŷ L orŷ U had zero standard deviation, and thus, the square of the correlation coefficient cannot be calculated.

Synthetic Nonlinear Symbolic Interval Datasets
We considered a synthetic interval-valued dataset with a nonlinear relation between the response variable and the explanatory variables for the midpoint and range of the intervals.
In order to test the predictive power of the proposed methods, we followed the approach of Lima Neto and De Carvalho in [18] and constructed a symbolic interval dataset with the values of the center and range of the intervals simulated according to a nonlinear relationship between the response variable and the explanatory variables.  Our goal was to construct a symbolic dataset with 3000 rows and 4 variables and compare the methods using a K-fold cross-validation scheme with 10 folds. We followed the following steps to construct the synthetic dataset.

2.
The center random variable Y c was related to the random variables X c j according to the logistic function: The range random variables X r j were uniformly distributed in the interval [1,4]; that is, X c j ∼ U [1,4].

4.
The range random variable Y r was related to the random variables X r j according to the exponential function:  As per Lima Neto and De Carvalho, the previous configuration had a high nonlinearity degree at the midpoints and a high nonlinearity degree in the ranges, as defined by the error components.
We here present the results for the synthetic dataset. We note from Table 8 that every CRM method outperformed all of the CM methods. On closer inspection, we noticed that the neural network CRM model had the best crossvalidation performance among all methods, and we observed that the proposed nonlinear models generally had better evaluation metrics. The second best method was the KNN CRM model, followed by the SVM CRM model. Table 9 shows the standard deviation of the methods among the 10 folds.
As we can see from Table 9, the standard deviation was low for all methods among the folds, which gave us confidence in our results.

Conclusions and Future Work
In this paper, we proposed 12 new methods of fit regression models to interval-valued variables, all based on the central idea of fitting regression models for the centers and ranges of the intervals and extending the ideas of the nonlinear methods for real-valued data. We presented new approaches to fit regression models for symbolic internal-valued variables, which were shown to improve and extend the center method the center and range method proposed by Lima Neto and De Carvalho in [6,13,14,18].
In the experimental evaluation, we found that the use of nonlinear methods greatly improved the prediction results in the regression problems. With the cardiological dataset, a simple neural network was able to radically improve the predictions in comparison to the other methods, especially when compared to those based on linear methods. In the Monte Carlo experiments, as expected, the linear models were the best when using synthetic linear symbolic interval datasets, and only the boosting center and range model was close in performance in one of the datasets. When using the synthetic nonlinear symbolic interval data, we observed the real power and advantages of the nonlinear framework for regression with interval-valued data; in particular, we saw the benefit of using neural networks, as this was once again the model that best captured the underlying structure of the data, when combined with the center and range approach. When comparing with the linear models, we saw that the neural networks center and range model had a RMSE L and a RMSE U of 0.051, which was more than half lower than the root-mean-squared-errors of those center and range models based on linear methods of 0.141; in a similar way, the neural network had a r 2 L and a r 2 U of 0.97, which was much higher than the square of the correlation coefficients of those based on linear methods of 0.77. In addition, we want to note that almost all of the proposed models outperformed the classical models based on linear approaches and that these results did not consider hyperparameter tuning in the methods, which in turn represents an opportunity to further improve the results.
Based on the results found, we not only achieved our goal of extending the tool kit of regression models for interval-valued datasets, that had focused on linear methods, but also, we demonstrated the predictive advantages of making the extension to nonlinear methods. This is relevant due to the fact that, in real-life applications, data rarely follow a linear structure.
The methods proposed, just as the original center method, have the problem explained by Lima Neto and De Carvalho in [6], which is that it cannot be mathematically guaranteed thatŷ L ≤ŷ U . In future work, we will apply the idea proposed in [6], which consists of generating the regression models using certain restrictions that will allow us to guarantee that the methods satisfy this restriction. This was not included in this paper because it was expected to cause confusion in the results, since it would not have been clear if improvements in predictions were due to the applications of shrinkage or to the application of restrictions in the regression methods.
Author Contributions: The authors contributed equally to this work. Both the authors have read and agreed to the published version of the manuscript.