Life Expectancy at Birth in Europe: An Econometric Approach Based on Random Forests Methodology

The objective of this work is to identify and classify the relative importance of several socioeconomic factors which explain life expectancy at birth in the European Union (EU) countries in the period 2008–2017, paying special attention to greenhouse gas emissions and public environmental expenditures. Methods: The Random Forests methodology was employed, which allows classification of the socioeconomic variables considered in the analysis according to their relative importance to explain health outcomes. Results: Per capita income, the educational level of the population, and the variable AREA (which reflects the subdivision of Europe into four relatively homogeneous areas), followed by the public expenditures on environmental and social protection, are the variables with the highest relevance in explaining life expectancy at birth in Europe over the perip.1 he peusto el correo e inciod 2008–2017. Conclusions: We have identified seven sectors as the main sources of greenhouse gas emissions: Electricity, gas, steam, and air conditioning supply; manufacturing; transportation and storage; agriculture, forestry, and fishing; construction; wholesale and retail trade, repair of motor vehicles and motorcycles; and mining and quarrying. Therefore, any public intervention related to environmental policy should be aimed at these economic sectors. Furthermore, it will be more effective to focus on public programs with higher relevance to the health status of the population, such as environmental and social protection expenditures.


Introduction
The impact of water, soil, noise, and air pollution on health has typically been recognized [1]. Specifically, air pollution continues to be one of the main environmental risk factors, with an estimated impact of 5.5 million deaths per year worldwide [2]. Air pollutants differ not only in their chemical composition, reaction properties, and emission, but in their time of disintegration and ability of diffusion over long or short distances [3].
However, in this paper, the main greenhouse gases (CO 2 , N 2 O, and CH 4 ) will be analyzed instead of air pollutants. Specifically, it is estimated that CO 2 emissions represent about three quarters of global greenhouse gas emissions, whereby CO 2 is seen as a major contributor to climate change [4]. This justifies the steadily growing number of researches referring to CO 2 emissions and climate change [5][6][7][8][9][10].
From a clinical or epidemiological approach, the influence of air pollution on health status has been widely documented by using microdata [11][12][13][14][15][16][17]. Furthermore, air pollution is a socioeconomic factor which is receiving growing interest in studies referring to the determinants of health outcomes from a macroeconomic approach in developed countries [18][19][20]. The major advantage of a macro-level approach is that it allows researchers to analyze health outcomes at a country level-by using state, regional, or local data-and to obtain suitable economic policy implications. For this reason, in this work, it was preferred to follow this second perspective of analysis.
Traditionally, public health expenditure has been considered as a health resource in previous literature referring to the determinants of health status of population from a macro-level perspective [21][22][23][24][25]. However, it should be noted that, recently, there has been a movement towards including new categories of public expenditure related to health-such as social expenditures-in this type of literature [26], but government environmental expenditure has not been included in the existing literature so far.
Therefore, the objective of this study is to identify and classify the relative importance of the socioeconomic factors in explaining life expectancy at birth in the European Union (EU) countries throughout the period 2008-2017, paying special attention to greenhouse gas emissions and public environmental expenditure. For this purpose, and according to previous studies, some "traditional" socioeconomic indicators are also included in the analysis, such as per capita income and education level of the population, in addition to some components of public budgets, such as health care and social protection expenditures.

Literature Review
Previous literature, starting from Auster et al. [27], has grouped the determinants of health status in developed countries into three categories-specifically, socio-economic factors (gross domestic product, per capita income, education, income distribution, unemployment and poverty, among others), health care resources (total, public, or private health care expenditure, expenditure on pharmaceutical products, doctors, nurses, and hospital beds), and lifestyle-related factors (such as consumption of alcohol and tobacco, and some proxies for diet such as, for example, consumption of sugar, calories and vegetables) [28,29].
Focused on socioeconomic factors, these preliminary studies, conducted at a macro level and referring to developed countries, have also included pollution, generally represented by emissions of polluting substances. Garbaccio et al. [56] concentrated on PM 10 and SO 2 to estimate the local effect of their emissions on health outcomes in China in 1992. In [21], NO x emissions per capita were selected as an approximation of air pollution in order to find the determinants of premature mortality in 21 OECD countries between 1970 and 1992. The authors of [22] also considered NO x emissions per capita to explore the effect of variations in the volume of health care and in certain characteristics of health systems on mortality across 21 OECD countries for the period 1970-1995 after controlling certain other determinants of health status, such as the education level of population. Nixon and Ullman [18] analyzed the relationship between total health care expenditure and health outcomes in European countries over the period 1980-1995 by using pollution and other factors affecting health outcomes. Joumard et al. [28] employed a panel of 23 OECD countries over the period 1981-2003 to assess the impact of health care resources on the health status of the population once due account is taken of other determinants of the population health status, such as NO x emissions. Mariani et al. [50] studied the two-way causality between life expectancy at birth and an environmental quality indicator and the resulting dynamic implications over a sample of 132 countries with different levels of development in 2006. Halicioglu [57] studied the determinants of life expectancy in Turkey for the period 1965-2005 by selecting social, economic, and environmental factors such as urbanization, considered as a proxy of pollution. De Keijzer et al. [58] investigated the association of exposure to air pollution (PM 10 , PM 2.5 , NO 2 , and O 3 ) and greenness with mortality and life expectancy in Spain from an ecological perspective for the period 2009-2013. Jiang et al. [20] selected SO 2 emissions to examine the effects of healthcare, education, environment, and social harmony development on life expectancy in 31 Chinese provinces between 2000 and 2010. Cheng et al. [59] analyzed the response of air pollutant emissions to climate change and the potential effects of these emissions on human health in China during 1970-2010. However, other studies prefer to analyze the effect of greenhouse gas emissions, mainly CO 2 , on health outcome. Thus, Monsef and Mehrjardi [19] investigated the factors affecting life expectancy in 136 (developed and developing) countries for the period 2002-2010, focusing on CO 2 emissions as an environmental factor. Jorgenson and Givens [60] compared CO 2 emissions with the changes in average life expectancy from 1990 to 2008 for OECD and non-OECD countries. Mohmmed et al. [61] studied the drivers and variations of CO 2 emissions by using data from the top 10 emitting countries (China, USA, India, Russian Federation, Japan, Germany, South Korea, Iran, Canada, and Saudi Arabia), as well as the impact of CO 2 emissions on healthy life expectancy.
This current study refers to EU countries, due to the scarce available literature on the socioeconomic determinants of health outcomes in this group of countries. Unlike previous researches, this study not only aims to examine the relationship between some socioeconomic indicators-such as greenhouse gas emissions and life expectancy at birth-but to classify these types of variables according to their relative importance as classification criteria. For this purpose, a Random-Forests-based methodology has been employed.

Variables
As indicated in the Introduction, the aim of this paper is to determine the relative importance of five socioeconomic factors and other two ad hoc variables in explaining life expectancy at birth in 28 EU countries for the period 2008-2017. In the following subsections, we will describe the variables used in this paper and the reasons whereby we consider that they could explain life expectancy at birth.

Socio-Economic Variables
Four socio-economic variables (per capita income, the educational level of the population, public social protection expenditure, and public health care expenditure) were selected because, according to the specialized literature on this topic [18,41,42,55], they are related to health status. Observe that the third and fourth variables are two categories of public expenditures. However, to the extent of our knowledge, environmental protection expenditure has never been considered in this analysis. This justifies its inclusion as the fifth socio-economic variable used in this manuscript.
For the sake of clarity, these socioeconomic variables have been defined (and numbered from 1 to 5) in Table 1. Variable definitions: 1 Gross National Income per capita of population at current prices (in thousands of Euro). 2 Percentage of population with upper secondary, post-secondary non-tertiary, and tertiary education: Levels 3-8. 3 General government expenditure on environmental protection (percentage of Gross Domestic Product, GDP). 4 General government expenditure on social protection (percentage of GDP). 5 General government expenditure on health (percentage of GDP).

Ad hoc Variables
Two other ad hoc variables were considered in this paper, viz. the total greenhouse gas emissions (denoted by GHG) and the area (represented by AREA) in which the analyzed country is included, according to the four European areas predefined following the spatial taxonomy performed by Ustaoglu and Williams [62]. In the following paragraphs of this subsection, we will justify the inclusion of these two variables.
Greenhouse gas emissions have been considered by the existing literature as an environmental factor of health status determinants [19,60,61]. With respect to the greenhouse gas emissions data in the EU, we have to take into account that not all productive sectors in the European economy are equally polluting [63]. Therefore, in order to quantify the greenhouse gas emissions in the EU, we have taken into account the different incidence of emissions from each of the productive sectors. This could be of relevance for implementing specific public policies oriented to restrict gas emissions. To do this, we employed Eurostat data of three of the main greenhouse gases (CO 2 , N 2 O, and CH 4 , all of them expressed in kilograms per capita), as indicated by [64], for the 21 economic sectors in the 28 EU member states.
More specifically, the procedure used to determine the variable GHG corresponding to each country was the following: 1.
The subscript i = 1, 2, 3 represents the greenhouse gases included in the analysis: CO 2 , N 2 O and CH 4 .

AVERAGE
j i denotes the annual average of the greenhouse gas i emitted by the economic sector j.
AVERAGE j i represents the aggregate average of the 3 greenhouse gases emitted by the 21 economic sectors.

5.
denotes the percentage of greenhouse gas emissions attributable to sector j with respect to the average sum. 6. GHG = k j ≥0.01 k j reflects the total percentage greenhouse gas emitted by sectors whose k j are greater than 1%. A justification of this choice will be displayed in Section 3.2.
On the other hand, the variable AREA is defined according to the categorization proposed by Ustaoglu and Williams [62], inspired by UNO (United Nations Organization) [65]. The importance of this variable lies in that it does not follow a strictly geographical [64] but a socio-political criterion. Specifically, these areas are: Therefore, the categorical variable AREA is a "collecting variable" which could be related with previous works which have analyzed life expectancy at birth from very different points of view: From a genetic point view [66] or by relating it to the degree of sun exposure [67], to the predominant traditional type of diet [68], to religiosity-according to the practice of a certain religious confession [69] or the frequency with which it is practiced, including atheism [70,71], or to the degree of perception of "health equality" [72].

Other Potential Variables
We would have liked to consider in this study other variables corresponding to lifestyle factors, such as smoking, obesity, and drinking, but unfortunately their respective data are not available for the 28 EU countries during the period 2008-2017.

Data
The socio-economic data used in this paper were obtained from Eurostat for the 28 countries that are currently members of the EU throughout the period 2008-2017. The main descriptive statistics have been compiled in Table 2. The inclusion of the greenhouse gas emissions in our study is justified by taking into account that this factor undoubtedly contributes to climate change [73].
On the other hand, the average emission by economic sector and greenhouse gas was calculated by using an initial database of 1764 observations. Subsequently, an average of the overall greenhouse gas emissions was calculated for each sector in Europe and then for its participation on the total emission amount (6804.91 kg per capita). This is represented in Figure 1.
It can be observed that, over the 21 analyzed economic sectors, only those seven with a percentage of greater than 1% over the calculated percentage average represent 94.38% of the greenhouse gas emissions. These seven sectors are: (1) Electricity, gas, steam, and air conditioning supply, (2) manufacturing, (3) transportation and storage, (4) agriculture, forestry and fishing, (5) construction, (6) wholesale and retail trade, repair of motor vehicles and motorcycles, and (7) mining and quarrying. This justifies that the variable GHG is defined as the sum of average emissions of CO 2 , N 2 O, and CH 4 coming from these sectors (see item #6, Section 3.1.2), making it necessary to highlight the huge presence of CO 2 with respect to the total average of the period (99.41%) versus the low levels of CH 4 and N 2 O (0.56% and 0.03%, respectively).

Random Forests Methodology
In this paper, we used the Random Forests methodology by Breiman [74,75] (or Breiman-Cutler's algorithm) due to its versatility. This model is especially suited to analyze the incidence or response, in terms of "relative importance", of a certain number of variables on another variable. This methodology, an extension of the "bagging" method [76], supposes the iteration and selection of n random trees based-among others-on the observations by Ho [77,78], who determined that, as a greater number of decision trees were randomly added to a previously created set, the performance of the predictions of the final generated model was improved (most times, at a monotonic rhythm). The Random Forests methodology generates a random extraction of m i predictors and an average of (p − m i )/p divisions, where p = m i . This characteristic avoids the preponderance of a certain predictor or variable, reducing the correlation between predictors.
The algorithm of Random Forest, derived from bagging predictors [76], can be described in the following way. We start from a p-dimensional random vector X = (X 1 , X 2 , . . . , X p ) T which represents the predictor variables and a random variable Y (called the real-valued response). An unknown joint distribution, denoted by P XY (X, Y), is assumed.
The aim of this algorithm is to find a prediction function f (X) for predicting Y. To do this, we minimize the expected value E XY (L(Y, f (X))), where L(Y, f (X)) is the so-called loss function which is an indicator of closeness between f (X) and Y.
In this case, by denoting Ψ the set of possible values of Y, the minimization of For the k-th tree (k = 1, 2, . . . , K), we are going to generate a random vector Θ k , independent of the former random vectors Θ 1 , Θ 2 , . . . , Θ k−1 but identically distributed, which results in a classifier, denoted by h(x, Θ k ), where x is an input vector. Thus, f (x) is the most frequently predicted class: (1) In a first step, the variable LE (life expectancy at birth) was categorized according to the following dichotomous specification, in which the increases or decreases of the 28 European countries included in the database are taken into account:  Table 3 displays the results derived from the constructed model. Accordingly, the main characteristics of the model are collected, following the specifications suggested by [79], and using the area under a Receiver Operating Characteristic (ROC) curve [80] as a measure of its reliability. Finally, in the implementation of the Random Forests model, 196 observations and six variables in each of the sub-divisions ("splits") were used, obtaining 10,000 random decision trees.

Results
With respect to the performance evaluation, the error rate ("out-of-bag", denoted by OOB) was estimated, based on those observations which have not been included in the "bag", that is, the subset of the initial set of learning data used in the iterative construction of each decision tree. Starting from this "unbiased" estimated error, it can be explained that, once the model has been constructed and new observations have been iteratively incorporated, the responses will lead to this residual error. Taking into account the generated model, the estimated OOB error is very low (1.02%). So, it is evident that this model is suitable to explain the behavior of life expectancy at birth, given that its accuracy is very high (98.98%).
Another proof of the degree of reliability of this model can be detected in Figure 2, since usually the increase in the number of random trees is associated with a growing decrease in the error rate (OOB). However, this particular model reaches a minimum error rate which is constant from a small number of random trees (around 300).
Additionally, the confusion matrix (or error matrix) provides the level of agreement-disagreement between the predictions obtained by the implemented model and the data generated by the learning observations-in this case, 196. In other words, it reflects the actual results vs. the predicted ones by calculating the true and false positive cases. With respect to the generated model, there is almost a coincidence between the joint value predicted by the model and the learning data: Both predict "Yes" (increase in life expectancy at birth) in 135 cases and "No" (decrease in life expectancy at birth) in 59 cases, showing a very low level of disparity or disagreement (no-yes = 1.667% and yes-no = 0.735%). Likewise, it should be noted that the reliability of the generated model is also high according to the procedure of the area under an ROC (Receiver Operating Characteristic) curve once the DeLong test [80] has been implemented (with a result very close to 1). Additionally, the confusion matrix (or error matrix) provides the level of agreementdisagreement between the predictions obtained by the implemented model and the data generated by the learning observations-in this case, 196. In other words, it reflects the actual results vs. the predicted ones by calculating the true and false positive cases. With respect to the generated model, there is almost a coincidence between the joint value predicted by the model and the learning data: Both predict "Yes" (increase in life expectancy at birth) in 135 cases and "No" (decrease in life expectancy at birth) in 59 cases, showing a very low level of disparity or disagreement (no-yes = 1.667% and yes-no = 0.735%). Likewise, it should be noted that the reliability of the generated model is also high according to the procedure of the area under an ROC (Receiver Operating Characteristic) curve once the DeLong test [80] has been implemented (with a result very close to 1).
Finally, we will analyze the relative importance of variables. In effect, the increments (LE = "Yes") or decreases (LE = "No") are represented in Figure 3, jointly associated with the rest of the variables included in the Random Forests model. More specifically, the mean decrease Gini and mean decrease accuracy measures indicate the order of relative importance of each variable in relation to LE. With respect to the first measure (mean decrease Gini), only the variables INCO (per capita income) and AREA would be representative, whilst the rest of variables only reach values denoting a very low relative importance. On the contrary, the mean decrease in the accuracy measure provides the following order regarding its relative representativeness: INCO, LEDU (educational level), AREA, ENVIRO (environmental protection), SOPRO (social protection), GHG, and HEALTH. Finally, we will analyze the relative importance of variables. In effect, the increments (LE = "Yes") or decreases (LE = "No") are represented in Figure 3, jointly associated with the rest of the variables included in the Random Forests model. More specifically, the mean decrease Gini and mean decrease accuracy measures indicate the order of relative importance of each variable in relation to LE. With respect to the first measure (mean decrease Gini), only the variables INCO (per capita income) and AREA would be representative, whilst the rest of variables only reach values denoting a very low relative importance. On the contrary, the mean decrease in the accuracy measure provides the following order regarding its relative representativeness: INCO, LEDU (educational level), AREA, ENVIRO (environmental protection), SOPRO (social protection), GHG, and HEALTH.

Discussion
According to the obtained results, per capita income, the educational level of the population, and the variable AREA (representative of belonging to a specific group of countries), followed by the public expenditure on environmental and social protection, are the variables with the highest relevance in explaining life expectancy at birth in Europe over the period 2008-2017. On the other hand, the least important variables are greenhouse gas emissions and public healthcare expenditure.
With respect to per capita income, our findings are consistent with previous studies about the major determinant of the population health status. Higher per capita income (or Gross Domestic Product) is associated with a healthier population and greater longevity [28,81,82]. In addition, our

Discussion
According to the obtained results, per capita income, the educational level of the population, and the variable AREA (representative of belonging to a specific group of countries), followed by the public expenditure on environmental and social protection, are the variables with the highest relevance in explaining life expectancy at birth in Europe over the period 2008-2017. On the other hand, the least important variables are greenhouse gas emissions and public healthcare expenditure.
With respect to per capita income, our findings are consistent with previous studies about the major determinant of the population health status. Higher per capita income (or Gross Domestic Product) is associated with a healthier population and greater longevity [28,81,82]. In addition, our findings suggest that the educational level of a population has a strong influence in determining health outcomes, as confirmed by previous empirical studies [41,42,83]. Belonging to a specific group of countries, as indicated by Álvarez-Gálvez et al. [84], is another factor with high influence on health outcomes.
A lower relative importance on health status of public environmental expenditure is showed in this work. To the best of our knowledge, this is the first time that this type of public expenditure is included in the analysis of the determinants of health outcomes. Consequently, more research is needed on this topic. On the contrary, the impact of government intervention in the field of environmental protection through fiscal policy has received more attention [85]. Government often transforms environmental costs into internal costs for enterprises through taxes. However, the impact of fiscal expenditures on environmental pollution is uncertain because this instrument reduces environmental pollution through structural and substitution effects, whilst improving environmental pollution through the growth effect [85].
Additional components of the government budget included in this study are social protection and health care expenditures. Public programs of social protection register more relative importance than public health care expenditure in explaining life expectancy at birth. The positive influence of social expenditure on health outcomes has been identified previously by Stuckler et al. [86], Bradley et al. [40,87], Vavken et al. [88], and McCullough and Leider [89]. On the other hand, health expenditure financed by the government is in the last position in terms of relative importance. Although some works reveal a positive significant contribution of this type of public expenditure on health outcomes [22,24,25,54], there is no consensus about its final impact [23,46,48]. Moreover, Kim and Wang [90] recently showed that the quality of government (measured by corruption control, government effectiveness, regulatory quality, voice and accountability, and rule of law) has a greater impact on public health than the quantity of government (measured by public expenditure on health).
The available evidence on the impact of greenhouse gas emissions on life expectancy from a macro approach is less scarce and conclusive. For this reason, this work has included this factor in this type of research by using macro data. Our results conclude that greenhouse gas emissions are one of the least important socioeconomic factors related to life expectancy at birth; this does not mean that these types of gases has no effect on health status, but that this item has less relevance than other factors such as per capita income and education. Previous research has shown no significant relationship between greenhouse gas emissions and health outcomes [19]. On the contrary, Mohmmed et al. [61] show, for the top ten CO 2 -emitting countries, a strong relationship between some sectors of CO 2 emission and healthy life expectancy which, according to these authors, indicates that an increase in the former factor will lead to a corresponding increase in the latter variable.

Conclusions
To the best of our knowledge, the analysis of greenhouse gas emissions-in different economic sectors-as a determinant of the health status of a population has not been extensively summarized and discussed. In order to overcome this limitation, this paper employed disaggregated data on the main greenhouse gas emissions (CO 2 , N 2 O, and CH 4 ) by economic sector. In this way, we identified seven sectors as the main sources of this type of emission: Electricity, gas, steam, and air conditioning supply; manufacturing; transportation and storage; agriculture, forestry, and fishing; construction; wholesale and retail trade, repair of motor vehicles and motorcycles; and mining and quarrying. Therefore, any public intervention addressed to environmental policy should be aimed at these economic sectors. At this point, our findings confirm part of Gao et al.'s [8] results, since they consider energy generation, transport, food and agriculture, household, and industry as the most GHG-emitting sectors.
This paper contributes to the existing literature in four ways: First, it extends previous literature about the socioeconomic determinants of health focused on European countries. Second, existing studies consider greenhouse gas emissions from the perspective of the economy as a whole; however, in this study, data on these types of gases will be considered as disaggregated by economic sectors. Third, there is scarce evidence so far demonstrating the extent to which government expenditure on environmental protection contributes to improvement of a population's health. Consequently, one of the main contributions of this study is the inclusion of this public expenditure in the empirical analysis. Fourth, we employ a Random Forests methodology which, to the extent of our knowledge, has not been applied before to study the variables related to health outcomes in Europe. As an advantage, this method provides a classification of the socioeconomic factors according to their relative importance to explain life expectancy at birth.
However, a limitation of this study is that lifestyle factors such as smoking, obesity, and drinking, among others, have not been considered, because their respective data are not available for the European countries during a consistent period. So, future research should incorporate this type of information if data is available.
In addition, based on the results of this Random Forests analysis, policy makers should concentrate efforts to improve per capita income and education. A novel contribution of this research is that it is crucial to evaluating the composition of the public budget because, in this paper, it has been concluded that certain public expenditure functions are more important in explaining life expectancy at birth in European countries. Consequently, it will be more effective to focus on public programs, such as environmental and social protection expenditures, with a higher relevance to the health status of a population. In order to work towards the target of improving or maintaining population health, joint actions and collaborations from different governmental departments are required at a national level.