Prediction of Fuel Poverty Potential Risk Index Using Six Regression Algorithms: A Case-Study of Chilean Social Dwellings

In recent times, studies about the accuracy of algorithms to predict different aspects of energy use in the building sector have flourished, being energy poverty one of the issues that has received considerable critical attention. Previous studies in this field have characterized it using different indicators, but they have failed to develop instruments to predict the risk of low-income households falling into energy poverty. This research explores the way in which six regression algorithms can accurately forecast the risk of energy poverty by means of the fuel poverty potential risk index. Using data from the national survey of socioeconomic conditions of Chilean households and generating data for different typologies of social dwellings (e.g., form ratio or roof surface area), this study simulated 38,880 cases and compared the accuracy of six algorithms. Multilayer perceptron, M5P and support vector regression delivered the best accuracy, with correlation coefficients over 99.5%. In terms of computing time, M5P outperforms the rest. Although these results suggest that energy poverty can be accurately predicted using simulated data, it remains necessary to test the algorithms against real data. These results can be useful in devising policies to tackle energy poverty


Energy Poverty: Definition and Conceptualization
Energy poverty, also called fuel poverty or energy vulnerability, takes place when households cannot keep comfortable temperatures inside their homes or cannot access energy services at a reasonable cost [1,2]. This phenomenon has received the attention of the scientific community and society in recent years due to its political and social implications [3,4]. Previous research has shown that energy poverty is a driver for the physical deterioration of residents [5,6] and is even responsible for a higher death rate in winter due to the poor thermal conditions inside buildings [7,8]. There is a wide consensus about the fact that energy poverty stems from a combination of high-energy prices, low family income, inefficient buildings, and outdated electrical household appliances [9,10]. Additionally, occupant behavior can contribute to new cases of energy poverty [11,12].
The measurement of energy poverty is a challenge because its condition is culturally sensitive and private, as well as temporal and dynamic [2]. It is a relative concept influenced by different variables, whose importance may vary depending on the context [13]. The limited availability of data and indicators, together with a lack of consensus on how energy poverty should be conceptualized and measured, makes the matter even worse. Energy poverty indicators are thus a pressing need in this research and political landscape [14]. The United Kingdom pioneered the study of fuel poverty in 1991, with the Table 1. Variation of ranges of Chilean deciles in recent years [38]. Second, the energy expenditure of households can be modeled according to the gap between the indoor and outdoor temperature. In this regard, a much-debated question is whether to use a static or an adaptive thermal comfort model. The static comfort model, which is adopted by many national standards, considers static setpoint temperatures. However, it fails to explain people's ability to adapt themselves to the thermal variations inside their homes within certain limits. Current Chilean standards use this approach ( Table 2). The adaptive comfort model builds upon the fact that inhabitants can adapt to thermal variations taking certain actions: Adapting their clothing and ventilating the building. These adaptive measures are feasible when indoor temperature, which is a function of the outdoor thermal variations [39], is within a certain range. Previous research has shown that adaptive thermal models can better represent the real energy consumption of buildings, which can vary between 10% and 18% when compared with the theoretical approach of the static model [40].

Year
Since evidence suggests that the adaptive comfort model can better explain the real energy consumption of buildings, the authors have recently examined the influence of the application of such a model on the quantification of energy poverty and have developed a new index called the fuel poverty potential risk index (FPPRI). The theoretical basis is detailed in previous publications by the authors [41] and also has been tested against different scenarios of climate change in Chile [42], also using advanced simulation techniques, such as multiple linear regression [37].
Evidence from these previous studies suggests that these techniques can predict the risk of fuel poverty with acceptable accuracy, but it still remains unknown whether other advanced techniques may outperform the former. There is still uncertainty whether fuel poverty can be predicted with different regression algorithms. The specific objective of this study is to test the accuracy of six regression algorithms to predict the risk of fuel poverty, expressed by the FPPRI (i) multilayer perceptron (MLP); (ii) K-nearest neighbors (K-NN); (iii) classification and regression tree (CART); (iv) random forest (RF); (v) M5P; and (vi) support vector regression (SVR). These algorithms were chosen because they are commonly used in the prediction of energy consumption in the building industry, as pointed out by a Sustainability 2021, 13, 2426 4 of 30 recent study [43], which concluded that interplay between the architecture of the models and the predicted phenomenon exerted a strong influence on the final accuracy. Debate continues about, which is the algorithm that best-fits each problem, and therefore this exploratory study aims to unravel the accuracy of these six advanced regression techniques in predicting the risk of falling into fuel poverty of financially deprived households in Chile by means of the FPPRI index.
It is expected that this research will contribute to a better understanding of how to predict energy poverty in advance, therefore making it possible to detect and tackle the problem in a swift and effective way. This new understanding should help designers and policymakers in building and delivering affordable and efficient social dwellings and in allocating financially deprived families in new houses whose energy expenditure would be suitable for their income level.
This paper is organized in the following way: First, an explanation of the predictive models is provided. Second, the predictor variables for the models are determined, and afterward, the data processing and results are presented. Finally, the results obtained by all models are compared and evaluated, which leads to the discussion of results and conclusions on the feasibility of such models.

Regression Algorithms
As previously mentioned, six regression algorithms were considered in this study. For each one of them, a brief explanation of its architecture and its main characteristics is provided. This information will be used later on to discuss the results of each one of them.

Multilayer Perceptron
A neural network is a statistical model simulating the neurological brain structure to solve linear and nonlinear problems [44]. This model is a computation paradigm, which allows complex problems, both classification [45,46] and regression problems [47,48], to be tackled. Neural networks have been widely used in the field of energy analysis over the last years: (i) Deb et al. [49] developed a neural network model to predict the energy saving associated with HVAC in office buildings and compared the results to multiple linear regression. The results determined a better performance for the neural network than linear regression, with a correlation coefficient higher than 37.25%; (ii) in a later study also by Deb et al. [50], a model was developed to predict the cooling load of the next day using data of the previous five days. The methodology showed an adequate accuracy, with a correlation higher than 94% in the days analyzed; (iii) Magalhães et al. [51] developed an artificial neural network model to predict the heating energy demand based on the occupants' behavior. The results obtained models with a correlation coefficient higher than 93%; (iv) Kljajic et al. [52] developed a neural network to predict the efficiency of boilers according to the operating performance as well as to predict possible improvement measures; and (v) Kialashaki and Reisel [53] conducted a large-scale application of the algorithm, designing a model to estimate the energy demand in the residential sector of the United States using historical data from the last years.
From the various architectures of artificial neural networks, MLPs are those providing the best features because they are supervised models with universal approximation capacities [54,55]. The MLPs are characterized by presenting an architecture of three or more layers (see Figure 1): an input layer, one or several intermediate layers, and an output layer. There are a series of nodes in each layer. The output value of each neuron is obtained by the sum of values of the input neurons weighted by synaptic weights and by applying an activation function (Equation (1)). These connections spread to the output layer (Equation (2)), obtaining the model's response (Ŷ MLP ).
where x j are the values of the input layer, w (1) k0 and x 0 are the weight and the input value of the bias neuron of the input layer, respectively, w (1) kj are the weights of the hidden layer, w (2) l0 and y 0 are the weight and the input value of the bias neuron of the hidden layer, respectively, w (2) lk are the weights of the output layer, y k is the output of a neuron of the hidden layer, and σ is the activation function. For this study, a sigmoidal activation function was used both in the hidden and the output layer (Equation (3)). The advantage of this kind of function is the possibility of comprising an infinite input set into a finite output set.
nability 2021, 13, x FOR PEER REVIEW where are the values of the input layer, ( ) and are th value of the bias neuron of the input layer, respectively, ( ) are den layer, ( ) and are the weight and the input value of the den layer, respectively, ( ) are the weights of the output laye neuron of the hidden layer, and is the activation function. For activation function was used both in the hidden and the output l advantage of this kind of function is the possibility of comprising a finite output set. The main objective of the algorithm is the adjustment of synaptic weights guaranteeing that the predicted output value for each input vector is near to the actual output value. For this purpose, a learning algorithm is applied to a training dataset. The models were trained by backpropagation [56], using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [57], which belongs to the set of quasi-Newton methods. The number of nodes and the number of hidden layers is analyzed to determine the best model [47]. For this study, architectures of 1 hidden layer were considered because they had better performance than more complex structures [58], but the number of nodes in such layer was analyzed.

K-Nearest Neighbors
The K-NN algorithm (also known as instance-based learning with parameter K) classifies observations based on the majority class between the nearest observations [59]. These observations, or neighbors, are chosen from the training dataset in which the classification value is known. The K-NN algorithm is used for class label problems [60] and regression analysis [61], implying their use in the energy analysis of installations and buildings: (i) Ramakrishna Madeti and Singh [62] developed a K-NN model to detect failures in photovoltaic devices. The percentage of instances correctly classified by this model was higher than 98%, obtaining better performance than other existing models; (ii) Rodger [63] developed a K-NN model to predict the heating energy demand. The values predicted by the model presented a correlation higher than 94% with respect to the actual values; and (iii) a model to determine the number of occupants inside the building during the monitoring was used [64]. The results obtained by the model designed in this study presented a higher accuracy that those obtained by linear discriminant functions.
In the regression analysis, the K-NN algorithm determines the value of the observation analyzed as an average of the nearest neighbors' values (K) of the training dataset. To deter-Sustainability 2021, 13, 2426 6 of 30 mine the closeness between observations, a metric distance is used. This distance strongly affects the performance of the model [65], so it is necessary to study which distance presents a better performance in the model. In the present study, the Minkowski distance was used (Equation (4)). This distance is characterized by the fact that the Minkowski distance is equal to other metric distances depending on the value assigned to q [66]: (i) the Manhattan distance (q equal to 1); (ii) the Euclidean distance (q equal to 2); and (iii) the Chebyshev distance (when q is near to the positive infinity). After determining the distances, the output value of the model is given by Equation (5). The output value (Ŷ kNN ) is obtained as the average of values (y m ) of the K nearest points. As can be seen, K is a key parameter in the model's response [67]. Both q and K were therefore analyzed due to their influence on the performance of the model.

Classification and Regression Tree
The CART algorithm, developed by Breiman et al. [68], constructs models in a reverse tree shape. Such models divide the input space into subregions, simplifying complex with simple problems [69]. Trees are composed of internal nodes corresponding to variables, arches to values of the root node, and leaves to the value of the dependent variable. Trees are easy to interpret because a scheme of nodes and leaves is available, thus easing the understanding of the solution adopted for the problem [70]. For this reason, trees are used in several works: (i) Tso and Yau [71] developed models to predict energy consumption by using three different algorithms: CART, MLP, and linear regression. The results revealed that both the CART and the MLP model obtained better performance than the linear regression; (ii) Mousa et al. [72] analyzed the use of a CART model to estimate the air change rate in buildings; and (iii) in another study [73], a CART model was developed to predict the monthly energy consumption in residential buildings. The results showed that the tree model obtained better performance than linear regression and multivariate adaptive regression splines.
The optimal structure is obtained by the binary recursive partitioning process during the development of the model: each node of the model has a partitioning rule. This rule is established by decreasing the total residual sum square. After carrying out the induction, the application of the pruning (that is, the removal of the inefficient nodes) allows the complexity of the model to be generalized and reduced [74]. The depth of the tree and the minimum number of instances per node is needed to be established in the configuration of the algorithm. These parameters affect the complexity and performance of the model, so they were analyzed.

Random Forest
As indicated above, the simplicity and interpretation of CART models make their use easier. However, many studies reflect the limitations of such models [75,76]. RF allows a solution to these problems to be adopted by creating a set of CART models generated in parallel to reduce the variance and the bias of the RF model [77,78]. RF is an ensemble learning algorithm, which obtains a better behavior than an individual model [79]. Another advantage of RF is the possibility of using large datasets, as well as not being affected by atypical values [80]. This aspect allows this model to be used to predict energy consumption. In this sense, Li et al. [81] developed an RF model to predict the daily electricity consumption in companies. The model had a more accurate prediction than artificial neural networks and support vector machines. Likewise, its monthly [82] and hourly [83] use as a predictive control system of HVAC systems has been analyzed in some studies.
For the RF training, N bootstrapped sample sets are drawn from the training dataset [78]. Each bootstrapped sample allows a forest regression tree to be generated. Moreover, each node of each tree is divided using a subset of predictors m randomly selected. This reduces the influence of the strongest predictors. After generating the set of T-trees and following the RF model structure (see Figure 2), the regression estimator (Ŷ RF ) is as follows: where T is the number of trees, andŶ t is the output of the t-th tree.
consumption. In this sense, Li et al. [81] developed an RF model to predict the daily electricity consumption in companies. The model had a more accurate prediction than artificial neural networks and support vector machines. Likewise, its monthly [82] and hourly [83] use as a predictive control system of HVAC systems has been analyzed in some studies.
For the RF training, bootstrapped sample sets are drawn from the training dataset [78]. Each bootstrapped sample allows a forest regression tree to be generated. Moreover, each node of each tree is divided using a subset of predictors m randomly selected. This reduces the influence of the strongest predictors. After generating the set of -trees and following the RF model structure (see Figure 2), the regression estimator ( ) is as follows: where is the number of trees, and is the output of the -th tree. The number of trees used by the model influences its performance [84]. For this reason, different numbers of trees were analyzed to obtain the model with the most efficient behavior.

M5P
The M5P algorithm (also known as M5) is an evolution of the CART algorithm [85,86]. Unlike the CART model, M5P combines decision trees with multiple regression: a decision tree is constructed following the same structure of reverse tree from the CART model, but a multiple linear regression (MLR) model is adjusted in each leaf of the model (Equation (7)). The algorithm, therefore, develops an MLR model in each subregion. In the development of the M5P tree, the internal variation of subsets for the class values of each branch is minimized instead of maximizing the information gain. After constructing the model, the pruning allows the overfitting to be reduced. The advantages of the models generated by this algorithm are the efficient use of huge amounts of numeric variables and their robustness because of the lack of values in the instances of the dataset analyzed [87,88]. The use of this model for the energy characterization of buildings has therefore increased in recent years. In this sense, Li et al. Afsarian et al. [89] developed an M5P model to predict the total energy consumption in a reference building. The results obtained a greater than 90% in most datasets. In another study by Jeffrey Kuo et al. [90], a comparative study of regression models was conducted to estimate the energy consumption in convenience stores of Taiwan. The use of four different models (M5P, Gaussian processes, linear regression, and support vector regression) was compared. The results obtained by M5P obtained a more accurate estimation than the other models. Likewise, M5P has a good performance in determining oscillations in energy prices. In a study by The number of trees used by the model influences its performance [84]. For this reason, different numbers of trees were analyzed to obtain the model with the most efficient behavior.

M5P
The M5P algorithm (also known as M5) is an evolution of the CART algorithm [85,86]. Unlike the CART model, M5P combines decision trees with multiple regression: a decision tree is constructed following the same structure of reverse tree from the CART model, but a multiple linear regression (MLR) model is adjusted in each leaf of the model (Equation (7)). The algorithm, therefore, develops an MLR model in each subregion. In the development of the M5P tree, the internal variation of subsets for the class values of each branch is minimized instead of maximizing the information gain. After constructing the model, the pruning allows the overfitting to be reduced. The advantages of the models generated by this algorithm are the efficient use of huge amounts of numeric variables and their robustness because of the lack of values in the instances of the dataset analyzed [87,88]. The use of this model for the energy characterization of buildings has therefore increased in recent years. In this sense, Li et al. Afsarian et al. [89] developed an M5P model to predict the total energy consumption in a reference building. The results obtained a R 2 greater than 90% in most datasets. In another study by Jeffrey Kuo et al. [90], a comparative study of regression models was conducted to estimate the energy consumption in convenience stores of Taiwan. The use of four different models (M5P, Gaussian processes, linear regression, and support vector regression) was compared. The results obtained by M5P obtained a more accurate estimation than the other models. Likewise, M5P has a good performance in determining oscillations in energy prices. In a study by Azofra et al. [91], a model was developed to determine the variations of the cost of electricity tariffs because of the generation of photovoltaic and wind power in Spain. The model was used to analyze various scenarios, thus determining the possible saving in the cost of electricity tariff.
Sustainability 2021, 13, 2426 where β 0 is the independent term, β i are the regression coefficients, x i are the predictor variables, and ε is the error.

Support Vector Regression
SVR is an application for regression analysis of the support vector machines [92]. SVR consists of transforming the root input variable in the space of higher dimension where two classes are separated by an optimal hyperplane [93,94]. Some advantages of SVR are the facts that they are less subject to overfitting [95] and their good behavior when tackling nonlinear problems [96]. It is an algorithm widely developed in the energy control of installations, such as lighting [97] and HVAC systems [98]. Likewise, SVR has been used in some studies to estimate the energy consumption in residential buildings [99,100] and offices [101,102].
For their mathematical approach, support vector machines approach the relationship between the output and input parameters by using the following equation: where ϕ(x) is representative of a nonlinear mapping function, w is a weight vector, and b is the bias term. w and b can be predicted by minimizing the regularized risk function (Equation (9)), guaranteeing the restrictions of Equation (10).
where N is the sampling number, C is penalty parameter to control the compensation between the regularization term ( w 2 ) and the training error, ψ is the maximum error allowed, ξ i and ξ * i represent the distance of the actual values from the upper and lower limit of the error allowed, respectively.
Finally, the output of the SVR (Equation (11)) is obtained by introducing Lagrange multipliers (δ i and δ * i ) and a Kernel function (K(x, x i )). In this regard, it is essential to determine which kernel function is used. There are different kernel functions: linear, polynomial, sigmoidal, and radial basis function (RBF). For this study, the RBF kernel is used (Equation (12)) because models with the best performance are obtained, and a lower number of parameters should be considered to improve the behavior of the model [74,103].
The parameters that should be considered to improve the performance of the model are C and γ. To determine the model with the best performance, different combinations of C and γ were used.

Description of the Dataset
Based on the calculation methodology by Pérez-Fargallo et al. [31,36], the procedure to determine the dataset was described in a previous study [37]. The case studies were configured using different parameters of the A1 and B1 typology of the social dwelling of the Chilean Ministry of Housing and Urbanism (MINVU in Spanish) [104]. A total of 7776 case studies were configured. Likewise, each case study is considered to belong to each of the 5 poorest deciles [105]. Thus, 5 different datasets per decile, composed of 7776 instances, were obtained. In total, 38,880 cases were considered.
The predictor variables were as follows: (i) orientation (OR), (ii) form ratio (FR), (iii) volume (V), (iv) surface in contact with the ground (S G ), (v) horizontal surface area in contact with another dwelling (S H ), (vi) roof surface area (S R ), (vii) vertical surface area in contact with another dwelling (S V ), (viii) shadow distance (D), (ix) shadow height (H), (x) energy price (EP), and (xi) income (IN). Figure 3 shows Pearson's correlation coefficients and the p-values. For nearly all of them, the null hypothesis can be rejected, and the Pearson's coefficient shows no evident correlation. A more surprising correlation is found between the group of variables related to the design of the dwelling (S G , S R , S H , S V and V), for, which the null hypothesis cannot be rejected, and moderate correlations are found in some cases. This is because those are interweaved in the design of the dwelling, but this does not compromise the accuracy of the model. The dependent variable was the FPPRI, which was logarithmically transformed to improve the accuracy of the final model.

Description of the Dataset
Based on the calculation methodology by Pérez-Fargallo et al. [31,36], the procedure to determine the dataset was described in a previous study [37]. The case studies were configured using different parameters of the A1 and B1 typology of the social dwelling of the Chilean Ministry of Housing and Urbanism (MINVU in Spanish) [104]. A total of 7776 case studies were configured. Likewise, each case study is considered to belong to each of the 5 poorest deciles [105]. Thus, 5 different datasets per decile, composed of 7776 instances, were obtained. In total, 38,880 cases were considered.
The predictor variables were as follows: (i) orientation (OR), (ii) form ratio (FR), (iii) volume (V), (iv) surface in contact with the ground (SG), (v) horizontal surface area in contact with another dwelling (SH), (vi) roof surface area (SR), (vii) vertical surface area in contact with another dwelling (SV), (viii) shadow distance (D), (ix) shadow height (H), (x) energy price (EP), and (xi) income (IN). Figure 3 shows Pearson's correlation coefficients and the P-values. For nearly all of them, the null hypothesis can be rejected, and the Pearson's coefficient shows no evident correlation. A more surprising correlation is found between the group of variables related to the design of the dwelling (SG, SR, SH, SV and V), for, which the null hypothesis cannot be rejected, and moderate correlations are found in some cases. This is because those are interweaved in the design of the dwelling, but this does not compromise the accuracy of the model. The dependent variable was the FPPRI, which was logarithmically transformed to improve the accuracy of the final model.

Training and Validation Procedure
As indicated above, five datasets containing 7776 instances were obtained per decile. Likewise, a full dataset with the case studies of the five deciles was generated (38,880 observations). These datasets were used to generate 6 different models for each regression algorithm: one per each decile (D1, D2, D3, D4, and D5) and another per all deciles (T). The datasets were randomly divided into two sets: 75% of the observations corresponded to the training dataset, and the remaining 25% to the testing dataset.
For the models' training, a 10-fold cross-validation was carried out. Cross-validation allows the bias and the variance of the model to be reduced [106]. All training datasets were randomly divided into 10 subsets. For each fold, 9 subsets were used for the training, and the remaining for the testing (Figure 4). The performance of the model is obtained by the average value of the 10-fold.

Training and Validation Procedure
As indicated above, five datasets containing 7776 instances were obtained per decile. Likewise, a full dataset with the case studies of the five deciles was generated (38,880 observations). These datasets were used to generate 6 different models for each regression algorithm: one per each decile (D1, D2, D3, D4, and D5) and another per all deciles (T). The datasets were randomly divided into two sets: 75% of the observations corresponded to the training dataset, and the remaining 25% to the testing dataset.
For the models' training, a 10-fold cross-validation was carried out. Cross-validation allows the bias and the variance of the model to be reduced [106]. All training datasets were randomly divided into 10 subsets. For each fold, 9 subsets were used for the training, and the remaining for the testing (Figure 4). The performance of the model is obtained by the average value of the 10-fold.  Three statistical parameters were used to evaluate the performance of the models: (i) the correlation coefficient (R 2 ) (Equation (13)), the root mean square error (RMSE) (Equation (14)), and the mean absolute error (MAE) (Equation (15)). The use of these parameters allows the performance of the models to be correctly defined [107] and also are amongst the most used when comparing the performance of algorithms with different architecture [43]. The lower the value of RMSE and MAE, the greater the accuracy. The threshold value of R2 was set at 0.95 to consider the predictions sufficiently accurate [37,48].
where m i is the model's prediction, t i is the actual value, and n is the number of instances in the dataset.

Results and Discussion
This section is organized as follows: First, the results from each algorithm are presented, along with a discussion of the particular features of their architecture. Finally, all algorithms are compared on the basis of the three parameters: R 2 , MAE, and RMSE.

MLP Models
The performance of the MLPs model was analyzed on the basis of the number of neurons in the hidden layer. As mentioned in Section 2.1, only one hidden layer was considered because it usually delivers better performance than more complex architectures [58]. Table 3 indicates the optimal number of neurons obtained for each model and the performance obtained in both the training and testing phases. What stands out in this table is that the optimal number of neurons in the hidden layer was very similar between the different models (between 12 and 14 nodes). Regarding the accuracy, R 2 was greater than 99.8% for all models in the training and testing phases, MAE was lower than 0.010, and RMSE lower than 0.012. The accuracy did not vary when considering the full dataset or each decile separately. For K-NN models, two parameters, q and K, have an influence on its accuracy. In this case, values of K ranged between 1 and 25, and values of q between 1 and 4 ( Figure 5). A different behavior was observed depending on the type of dataset used: for models per deciles, a value of q of 2 (corresponding to the Euclidean distance) obtained the best performance, whereas, for K-NN T , a value of q of 4 obtained the best statistical parameters. In this regard, the number of close instances used by the model varied: for models per deciles, values of K between 9 and 11 obtained acceptable results (in the testing, MAE was between 0.059 and 0.072, and RMSE between 0.076 and 0.094), whereas, for K-NN T , the number of K was 2. The performance of K-NN T was lower than models per deciles because a low number of instances was used to obtain a representative result (in the testing, MAE was 0.082 and RMSE was 0.110). In this sense, both in the training and testing phases, R 2 was always below 0.95, except for one case, MAE and RMSE were very similar for all sets of data (Table 4).
parameters. In this regard, the number of close instances used by the model varied: for models per deciles, values of between 9 and 11 obtained acceptable results (in the testing, was between 0.059 and 0.072, and between 0.076 and 0.094), whereas, for -NNT, the number of was 2. The performance of -NNT was lower than models per deciles because a low number of instances was used to obtain a representative result (in the testing, was 0.082 and was 0.110). In this sense, both in the training and testing phases, was always below 0.95, except for one case, and were very similar for all sets of data (Table 2).

CART Models
In this model, the parameters of depth and instances per node were analyzed: the depth of the tree oscillated between 2 and 12 (the maximum value was established according to the number of variables of the dataset), and the number of observations per node ranged between 2 and 30. Figure 6 is a sample of the analysis carried out. The performance of the model is strongly influenced by the depth of the tree: the higher the depth, the greater the correlation (values of greater than 97.80%) and the lower the MAE and RMSE. The most optimal depth for all models was, therefore, the maximum suggested (12 levels). Although this could mean overfitting to training data, the performance obtained in the testing phase was similar to that obtained in training (Table 3); therefore, it was not

CART Models
In this model, the parameters of depth and instances per node were analyzed: the depth of the tree oscillated between 2 and 12 (the maximum value was established according to the number of variables of the dataset), and the number of observations per node ranged between 2 and 30. Figure 6 is a sample of the analysis carried out. The performance of the model is strongly influenced by the depth of the tree: the higher the depth, the greater the correlation (values of R 2 greater than 97.80%) and the lower the MAE and RMSE. The most optimal depth for all models was, therefore, the maximum suggested (12 levels). Although this could mean overfitting to training data, the performance obtained in the testing phase was similar to that obtained in training (Table 5); therefore, it was not considered necessary to generalize the tree (i.e., to simplify it) to obtain a better performance in new observations. When comparing the accuracy for the full dataset and for each decile, the full model (CART T ) delivered better results than each decile separately in the testing phase. R 2 increased between 1.0 and 1.3%, and MAE and RMSE had values lower than 0.008 and 0.010, respectively. This is because the structure obtained is better adjusted to the full dataset than to the models per deciles. considered necessary to generalize the tree (i.e., to simplify it) to obtain a better performance in new observations. When comparing the accuracy for the full dataset and for each decile, the full model (CARTT) delivered better results than each decile separately in the testing phase.
increased between 1.0 and 1.3%, and and had values lower than 0.008 and 0.010, respectively. This is because the structure obtained is better adjusted to the full dataset than to the models per deciles.

RF Models
In this model, the number of trees influences the performance of the model. For this reason, the behavior of the model with a number of trees between 1 and 200 was analyzed by using both the same depth and number of instances used in the CART models. Figure 7 represents the performance obtained for the RFD1 model, whereas Table 4 indicates the values obtained for all models. As can be seen in Figure 7, the performance of the model improved as the number of trees increased. However, when the optimal value was reached (149 in the case of D1), the performance of the model did not improve: the number of trees only increased the time required to train the model. Likewise, the optimal number obtained for each model was very similar, varying between 143 and 153. With respect to the models obtained, the performance was very similar for models per deciles and the full model. In this sense, obtained values between 99.3 and 99.6%, whereas and oscillated between 0.021 and 0.023 and between 0.027 and 0.030, respectively.

RF Models
In this model, the number of trees influences the performance of the model. For this reason, the behavior of the model with a number of trees between 1 and 200 was analyzed by using both the same depth and number of instances used in the CART models. Figure 7 represents the performance obtained for the RF D1 model, whereas Table 6 indicates the values obtained for all models. As can be seen in Figure 7, the performance of the model improved as the number of trees increased. However, when the optimal value was reached (149 in the case of D1), the performance of the model did not improve: the number of trees only increased the time required to train the model. Likewise, the optimal number obtained for each model was very similar, varying between 143 and 153. With respect to the models obtained, the performance was very similar for models per deciles and the full model. In this sense, R 2 obtained values between 99.3 and 99.6%, whereas MAE and RMSE oscillated between 0.021 and 0.023 and between 0.027 and 0.030, respectively.  In M5P models, no parameter was analyzed to optimize their performance. As can be seen in Figure 8, M5P determines the optimal depth of the model. Therefore, it is not necessary to analyze this aspect in such performances. Regarding the values of the statistical parameters, the performance was very similar (Table 5): was between 0.995 and 0.998, between 0.014 and 0.016, and between 0.018 and 0.020.  In M5P models, no parameter was analyzed to optimize their performance. As can be seen in Figure 8, M5P determines the optimal depth of the model. Therefore, it is not necessary to analyze this aspect in such performances. Regarding the values of the statistical parameters, the performance was very similar (Table 7): R 2 was between 0.995 and 0.998, MAE between 0.014 and 0.016, and RMSE between 0.018 and 0.020.

SVR Models
As indicated in Section 2.1.6, both C and γ affect the performance of the SVR models. Different combinations of these two parameters were therefore analyzed. In this sense, C adopted values between 0.1 and 4, whereas γ was analyzed using values between 0.05 and 0.2. Figure 9 represents the results of the various combinations in SVR D1 , and Table 8 indicates the optimal configuration and the results obtained by each model. Several aspects can be appreciated from the analysis: (i) optimal values of the models were obtained from values of γ of 0.15. Although best values in the statistical parameters were obtained as the value of γ increased, the value of C obtained the same values for lower values of γ. Values γ lower than 0.15 did not obtain the same optimal performance; (ii) high values of γ and C influenced the training time. In this regard, the time difference between different values of γ was greater than 1000 s, slowing down the training with high γ. Thus, the optimal value of γ obtained for all models was 0.15, whereas the value of C varied between 1.6 and 2.1 (Table 8). Likewise, the performance obtained both in the training and testing phases was similar. SVR T presents R 2 quite similar to models per deciles in the testing, MAE is greater than 0.002, and RMSE ranges between 0.004 and 0.006. Sustainability 2021, 13, x FOR PEER REVIEW 15 of 31 Figure 8. Structure of the M5PD1 model. As indicated in Section 2.1.6, both and affect the performance of the SVR models. Different combinations of these two parameters were therefore analyzed. In this sense, adopted values between 0.1 and 4, whereas was analyzed using values between 0.05 and 0.2. Figure 9 represents the results of the various combinations in SVRD1, and Table 6 indicates the optimal configuration and the results obtained by each model. Several aspects can be appreciated from the analysis: (i) optimal values of the models were obtained from values of of 0.15. Although best values in the statistical parameters were obtained as the value of increased, the value of obtained the same values for lower values of . Values lower than 0.15 did not obtain the same optimal performance; (ii) high values of and influenced the training time. In this regard, the time difference between different values of was greater than 1000 s, slowing down the training with high . Thus, the optimal value of obtained for all models was 0.15, whereas the value of varied between 1.6 and 2.1 (Table 6). Likewise, the performance obtained both in the training and testing phases was similar. SVRT presents quite similar to models per deciles in the testing, is greater than 0.002, and ranges between 0.004 and 0.006.

Comparison of Models
The algorithms were compared on the basis of two parameters: (I) the individual performance obtained by each decile and (ii) the performance obtained in the estimation of the 5 deciles together. Figure 10 shows the scatter plots comparing the actual and predicted values of the total dataset (as well as the values of the statistical parameters), and Table 9 is a compilation of the statistical parameters obtained per deciles separately. Likewise, the point obtained by each model per decile is represented in Appendix A ( Figures A2-A6). Several aspects regarding the comparative analysis are worth noting. First, the estimation conducted by each model was merged in a combined estimation (D1-5) with the corresponding values of R 2 , MAE, and RMSE. Then, the full model of each algorithm was used to evaluate the estimation conducted in each decile.
Regarding the performance obtained by each decile, there were different results. The algorithm that delivered with the worst performance was K-NN: R 2 was lower than 95% for almost all decile models (only D1 obtained R 2 of 95%), whereas K-NN T had a different behavior depending on the decile analyzed, obtaining the best behavior in deciles 3, 4, and 5, and the worst in deciles 1 and 2 (0.880 and 0.939). Therefore, the performance obtained using the K-NN algorithm was not considered adequate (see Figure 10).
The use of the other algorithms gave optimal correlation coefficients (higher than 98% in all cases). The analysis of the error parameters revealed that CART and RF models obtained error values higher than MLP, M5P, and SVR: MAE and RMSE increased regarding the highest value of M5P of 21.43% and 23.53%, respectively. Therefore, the performance obtained by these two models, although being acceptable, did not present the adjustment degree of MLP, M5P, and SVR. Regarding these three algorithms, the results were quite similar for all the cases analyzed: R 2 oscillated between 0.996 and 0.999, MAE between 0.008 and 0.018, and RMSE between 0.010 and 0.026. Despite the performance obtained by the full models in each decile was optimal, SVR T had the worst performance in deciles 1 and 2. This aspect influenced the estimation conducted by the full model, which was worse than that conducted by the set of decile models (MAE decreased by 0.002, and RMSE by 0.005). On the other hand, the scatter plots obtained by MLP and M5P were similar, both with an adequate adjustment.
Thus, these two algorithms obtained the best performance, particularly if the time required to train the SVR models is considered (Table 10). The time needed to carry out the training of these models was more than 1990s, and SVR T needed more than 18 h for the training. The remaining algorithms needed quite short times: MLP and K-NN

Conclusions
The present exploratory study set out with the aim of clarifying the best algorithm to predict FPPRI of financially deprived households in Chile. Considering their performance, it can be concluded that multilayer perceptron, support vector regression and M5P are, in this order, the best option. Correlation coefficients were higher than 99.5%, and the values of MAE and RMSE were lower than 0.018 and 0.026, respectively. Including computing time as a variable, SVR should be discarded in favor of M5P and MLP. Finally, since M5P delivers the shortest computing time for all of them, it could be labeled as the best option in terms of speed and performance.
Comparison of the findings with those of a previous study by the authors that assessed the accuracy of multiple linear regression (MLR) models also confirms that M5P outperforms the former [37]. The MLR delivered values of R2 between 0.807 and 0.963 and values of RMSE between 0.013 and 0.091. A possible explanation for this might be the architecture of the M5P model itself, which combines a CART algorithm with an MLR model in each subregion; that is, each leaf of the tree has one MLR model, which is the regression algorithm used in the previous study. As pointed out by other authors, the particular architecture of this model seems to be suited for the prediction of the variations of energy prices [91] and the energy consumption [89], as in our case. In terms of computing time, the CART algorithm was the fastest, and adding an MLR model to each leaf doubles computing time but still places this algorithm as the second-fastest. Additionally, this study has clarified that the design of the M5P model should follow some recommendations. Even though the model automatically determines the depth of the tree, caution should be exercised so as not to consider more than 150 trees, as the results from the original CART model suggest.
These results also have two main methodological implications. First, the M5P models can be easily programmed in different languages because it is composed by if-then rules, and also because it divides the input space into a linear regression model, delivering an equation whose mathematical meaning is easier to grasp. Second, the FPPRI was tested against a nationally representative sample size in the Chilean context, but the methodology can be extrapolated to other countries by making the necessary adjustments in the predictor variables.
The most important limitation of the present study lies in the fact that the tested algorithms are data-driven or black-box models; that is, all data were artificially generated. Further studies need to be carried out using real data from the energy expenditure of Chilean households in order to validate the results from the present study. The biggest challenge would be compiling a statistically representative pool of data about the energy expenditure of households. The CASEN survey still does not cover this aspect, which should be considered in the future as part of the so-called "multidimensional poverty".
Overall, and considering its scientific contribution and also its limitation, this study has provided a deeper insight into the prediction of fuel poverty by using different datadriven algorithms. These findings may have a number of important implications for the design of future policies tackling energy poverty before reallocating families into a new house, thus providing the basis towards a sustainable development of social housing that considers the economic situation of financially deprived households. Funding: This research has been funded by the "Consejería de Economía, Conocimiento, Empresas y Universidad de la Junta de Andalucía" through a postdoctoral research programme at University of Sevilla.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Acknowledgments:
The authors would also like to acknowledge that this paper is part of the project "Conicyt Fondecyt Regular 1200551-Energy poverty prediction based on social housing architectural design in the central and central-southern zones of Chile: an innovative index to analyze and reduce the risk of energy poverty" funded by the National Agency for Research and Development (ANID).
In addition, we would like to acknowledge to the research group "Confort ambiental y pobreza energética (+CO-PE)" of the University of the Bío-Bío for supporting this research.

Conflicts of Interest:
The authors declare no conflict of interest.

Symbols b
Bias term (support vector regression) C Penalty parameter (support vector regression) d q Minkowski