1. Introduction
Modern cities currently face numerous issues arising from the increasing number of people living in them, environmental impacts, limited resources, and the intricate dynamics of social and economic factors. The rising population density strains infrastructure, resulting in traffic jams, strained public services, and heightened pollution levels. Climate change further threatens cities’ resilience as increasing temperatures impact sustainability and liveability. Moreover, socio-economic gaps worsen disparities in accessing education, healthcare, and job opportunities. In addressing these challenges, smart cities are transitioning towards a strategy that utilises technology- and data-driven solutions to enhance efficiency, sustainability, and overall well-being. As countries worldwide embark on urbanisation endeavours, the influence of these advancements on citizens’ quality of life gains significance. This study aims to investigate the correlation between socio-economic and environmental factors and life expectancy at a European Union (EU) scale.
From the urban science perspective, cities can be perceived through various lenses, including spatial, socio-economic, and environmental dimensions. Spatially, different methodologies and approaches can be used to delineate city boundaries. Core-based statistical areas, urban areas, or administrative boundaries can lead to different city characteristics and spatial distributions [
1]. However, Sahasranaman and Bettencourt [
2] and Lobo et al. [
3] suggest that cities are not only geographical entities but complex systems that facilitate economic and social interactions. With growing recognition of environmental challenges such as the consequences of climate change or limited natural resources, ecological footprints and environmental sustainability aspects have become integral components in defining cities. Dias et al. [
4] emphasise the holistic approach that seeks a balance between economic development, quality of the environment, and citizens’ quality of life. Ensuring the efficient and sustainable management of resources and consideration of environmental and social aspects in decision-making processes is necessary to incorporate these aspects into urban governance [
5]. In order to enhance multifaceted urban sustainability and improve citizens’ quality of life, many cities are turning to the concept of smart cities.
Smart city programmes and initiatives incorporate information and communication technologies (ICT) to improve energy efficiency and the use of renewable energy, optimise transportation systems, enhance recycling and circularity, and improve public services. These efforts not only enhance efficiency in using resources and supporting environmental sustainability but also encourage inclusivity, incentivise citizen involvement, and strengthen overall city resilience. However, the relationship between smart cities’ development and public health on a national scale is an aspect that warrants further attention from decision-makers, researchers, and practitioners. As a result, global organisations like the United Nations (UN) and the European Union (EU) have stressed the importance of plans that focus on promoting health and quality of life within the larger framework of sustainable development.
The global importance of resilient and sustainable urban environments and inclusive communities is highlighted in the UN’s key documents and initiatives, such as Agenda 2030 and the Sustainable Development Goals, where SDG 11 in particular addresses this issue. This area is of utmost importance at the EU level. For instance, the EU supports smart, green, and integrated urban development through the EU Smart Cities programme, under Horizon 2020 [
6]. The role of digital transformation as a vehicle for improving the lives of citizens is further underlined in the EU’s Digital Agenda [
7]. Moreover, the transformative potential of ICT in creating more sustainable and efficient cities is recognised in the European Commission’s Smart Cities and Communities initiative [
8], which enhances and supports the overall objectives of the European Urban Agenda.
This study aligns well with the international commitments set by the UN and the EU to leverage technological advances to improve public health and well-being. As the UN emphasises, sustainable development can only be achieved when there is a balance between the ways we develop and manage urban areas [
9].
The cross-national focus of this study resonates with the wider global community aiming towards the SDGs. As the UN points out, “cities are hubs for ideas, commerce, social development and more” [
10]. In addition, smart cities have the potential to improve the living conditions of their citizens by helping to reduce emissions, increase efficiency, promote sustainable economic growth, and improve overall urban life [
11,
12]. The aim of this study is to explore the relationships among a diverse set of variables, such as GDP, CO
2 emissions, renewable energy use, education, recycling rates, and transport-related emissions, and examine how they interact with life expectancy in the European Union. While our research focuses on city issues, we opted to use national data from EU member states, due to the unavailability of reliable and comparable data at the city level. This approach allows us to explore wider patterns that may reflect the urban dynamics at a national scale. Given the complexity and heterogeneity of cities within EU countries, national-level analysis helps us understand the overall patterns and identify commonalities and disparities in urban issues across EU countries. This research intends to employ regression clustering analysis to approach the unique segmentation of EU countries as clusters based on common characteristics, allowing us to account for regional specificities and heterogeneities. This innovative methodology is anticipated to not only enhance the accuracy of our findings but also to enable us to provide more tailored and effective urban management recommendations that reflect the distinct contexts of different EU member states.
The rest of the paper is structured as follows:
Section 2 is devoted to reviewing current literature and provides a rationale for choosing variables. In
Section 3, we formulate our research hypothesis and present the data. Results are summarized in
Section 4 and discussed in
Section 5. The last part concludes the findings and outlines the policy implications. The paper has an ambition to extend our academic knowledge, but also contribute to the ongoing international dialogue on urban development and its impact on the health and well-being of populations in different country-specific conditions.
2. Literature Review and Rationale for Selecting Variables
As cities become centrepieces of human development, smart city initiatives focused on improving the urban living conditions and quality of life of citizens have gained increasing attention in recent years. In this context, life expectancy has become a critical indicator of the population´s wellbeing. The literature consistently supports the assertion that life expectancy is influenced by various interconnected socio-economic, environmental, and health factors that collectively shape the quality of life within society.
Socio-economic factors, such as income and education, play a crucial role in determining life expectancy. Studies by Gilligan & Skrepnek [
13] and Miladinov [
14] suggest that higher income levels correlate positively with increased life expectancy, as wealthier individuals typically have better access to healthcare, nutrition, and living conditions. Research has also indicated that greater economic prosperity can result in increased investments in environmental protection and programs aimed at improving public health [
15,
16]. Education also plays a critical role as individuals with higher educational attainment tend to engage in healthier lifestyles and have better access to health resources, which in turn enhances their longevity [
17,
18]. Additionally, education is a strong predictor of health literacy, enabling individuals to make informed health choices that can enhance their longevity [
19]. Furthermore, social policies that allocate resources towards education and healthcare have been linked to improved life expectancy outcomes, particularly in high-income countries [
20]. Higher educational attainment is also associated with higher income and better employment opportunities, which can consequently lead to improved health and well-being [
21]. On the other hand, the disparities in life expectancy among different socio-economic groups highlight the importance of addressing social determinants of health to improve overall population well-being [
19].
CO
2 emissions are a primary contributor to climate change and are linked to negative impacts on health, which can further negatively affect life expectancy. Higher concentrations of CO
2 emissions and long-term exposure to pollutants are associated with respiratory diseases and other negative health effects [
22,
23], that further reduce life expectancy. For example, research by Chang [
24] indicates that the negative impact of CO
2 emissions on life expectancy is particularly pronounced in countries, where industrial activities and urbanization contribute significantly to air quality deterioration. Implementing smart city initiatives such as green mobility [
25,
26] and the use of renewable energy [
27,
28], have the potential to reduce these emissions and hence related health risks [
29]. Renewables as a percentage of equivalent primary energy consumption is an indicator representing a green energy transition, reducing dependence on fossil fuels and mitigating the effects of climate change [
30,
31]. Fossil fuel combustion is a main source of air pollution, emitting harmful pollutants such as particulate matter (PM 2.5; 10), sulphur dioxide (SO
2), nitrogen oxides (NOx), and volatile organic compounds (VOCs). These pollutants can negatively affect respiratory health, the cardiovascular system, and general well-being [
32,
33]. In contrast, renewable energy sources do not emit pollutants during energy production and thus significantly reduce air pollution levels in urban environments [
28,
34]. Studies have shown the health co-benefits of using renewable energy, including reducing air pollution-related diseases and premature mortality [
35]. Similar findings were confirmed by Zhang et al. [
32] investigated the impact of foreign direct investment inflow, carbon emissions, and renewables on health quality in China. Results suggest that the CO
2 emissions hurt health quality. Moreover, a significant and positive impact was reported in the case of renewable energy consumption. The study showed that higher shares of renewable energy tend to have better health outcomes and longer life expectancies due to reduced exposure to harmful pollutants associated with fossil fuel combustion.
Another critical factor influencing life expectancy is urbanization. As urban areas are often characterized by higher population density and higher pollution levels, this tends to cause lifestyle-related health risks [
36,
37]. Especially rapid urbanization and increased population density has been linked to high levels of pollution, inadequate infrastructure, and limited access to health services [
38,
39]. However, urbanization also promotes economic growth and innovation [
40]. In this term, urbanization can provide opportunities for improved healthcare access and social services, enhancing life expectancy if managed effectively. Highlighting the dual nature of urbanization, Zeng et al. [
38] call for sustainable urban planning that prioritizes health and well-being, such as creating green spaces and improving public transportation to reduce emissions.
Finally, transport-related emissions, particularly from fossil-fuelled vehicles, are a significant source of air pollution in urban areas [
41]. These emissions release harmful pollutants such as NOx, PM, and VOCs [
25,
42]. Exposure to emissions has been linked to a range of adverse health outcomes, including aggravation of asthma, lung cancer, and cardiovascular disease. Vulnerable populations are particularly susceptible to the health effects of transport-related pollution [
43]. Health issues related to high concentrations and long exposure to these pollutants can reduce life expectancy [
44].
The concept of smart cities concentrates on a wide spectrum of aspects. For instance, Abu-Rayash and Dincer [
12] address smart energy systems. They propose a model for evaluating smart cities based on a complex system of smart city characteristics. The study, in particular, highlights the positive role of smart energy systems and their correlation with economic performance and smart city governance. Similarly, Ji et al. [
45] performed a case study among citizens in Taiwan, investigating their perceptions and preferences for smart city services. The study underlines the need to align technological advances with the needs of residents. From citizens’ perspectives, smart energy systems and smart mobility were the highly preferred areas, with the potential to improve the quality of life and well-being of citizens. This perspective is complemented by Macke et al. [
46], who examine citizens’ perceptions of quality of life in smart cities. The factors that influence the perceptions of quality of life are identified. The study emphasises the importance and advocates for an integrated approach that combines both objective and subjective aspects of quality of life and provides valuable insight for policymakers and urban planners who are concerned about improving the living conditions of citizens in smart cities.
The social and cultural aspects of smart cities are explored by Han and Hawken [
47], who highlight the importance of a more holistic approach and the necessity to take cultural nuances, social identity and human behaviour into consideration when designing smart cities. They criticize the common metrics for assessing smart cities and advocate for balancing urban innovations and technology integration with cities’ unique historical and cultural contexts and the needs of citizens.
The ample evidence from the literature provides significant insights into the multifaceted nature of smart cities, particularly regarding sustainability and quality of life. However, there remains a critical gap in understanding how the specific mechanisms influence life expectancy—an important variable referencing health outcomes. While previous studies have often focused on singular aspects such as city governance, environmental sustainability, energy issues and urban livability, they haven´t fully explored the multidimensional relationships between smart city variables and life expectancy. Also, although some research has linked environmental factors like CO2 emissions and renewable energy with health impacts, the interplay between these and other socio-economic factors, such as urbanization, education, and economic growth, is still unexplored.
Focusing on a comprehensive set of variables derived from theoretical frameworks and empirical studies, this research seeks to address these gaps and contribute to a deeper understanding of the complex interplay between smart city indicators and public health outcomes. The novelty of this study lies in its holistic approach to examining the relationship between smart city characteristics and life expectancy. Through rigorous analysis and interpretation of national-level data, we aim to provide actionable insights for policymakers, urban planners and public health professionals aiming to create healthier and more resilient cities.
These variables were selected as input for our analysis to help us better understand complex relations between smart city indicators and life expectancy and identify key drivers of public health in a smart city context. Each variable represents a different aspect of urban development, environmental sustainability or socio-economic well-being, with the potential to influence life expectancy through different pathways.
By examining these smart city indicators in conjunction with life expectancy, this research aims to elucidate the complex interactions between urbanisation, economic activities, environmental quality and socio-economic inequalities on public health outcomes.
4. Data and Methods
Using annual data from 2000 to 2021, the impact of various smart city indicators, including GDP per capita in PPP, annual CO2 emissions per capita, share of urban population, renewable energy (% primary energy equivalent), tertiary education attainment, municipal waste recycling rate and transportation emissions, on life expectancy at birth is examined. The analysis focused on twenty-five European countries, namely Austria (AUT), Belgium (BEL), Bulgaria (BGR), Croatia (HRV), the Czech Republic (CZE), Denmark (DNK), Estonia (EST), France (FRA), Finland (FIN), Germany (GER), Greece (GRC), Hungary (HUN), Ireland (IRL), Italy (IT) Ireland (IRL), Italy (ITA), Lithuania (LTU), Latvia (LVA), Luxembourg (LUX), the Netherlands (NLD), Poland (POL), Portugal (PRT), Romania (ROU), Slovakia (SVK), Slovenia (SVN), Spain (ESP) and Sweden (SWE).
Based on the literature review, the eight variables were selected as input into the econometric models, as shown in
Table 1. The models are developed using Stata 15.1.
Table 2 shows descriptive statistics. The average CO
2 per capita is 8.07 tons, with a standard deviation of 3.57 tons, indicating relatively high variability in CO
2 emissions between countries. The wide range among countries indicates different levels of industrial activities, composition of energy mixes and policy regulations. Life expectancy (LIFE) ranges from a minimum of 69.5 years in Romania (ROU) to a maximum of 83.55 years in Italy (IT). This 14-year difference reflects possible variations in healthcare quality, socio-economic and environmental conditions, as well as lifestyle factors.
Table 3 shows the correlation matrix. Most of the correlation coefficients are statistically significant. Zooming in on life expectancy, we observe a strong positive correlation between GDP and life expectancy, suggesting that economic welfare positively correlates with longer life expectancy. Similarly, a strong positive correlation exists between recycling rate and life expectancy, suggesting positive impacts of efficient waste management and cleaner environment. A positive correlation can also be observed in the case of urban population rate and education. In contrast, there is a negative correlation between life expectancy and transport-related emissions.
First, a multiple regression analysis is performed with life expectancy at birth (LIFE) as the dependent variable and all other variables as input variables.
The resulting regression coefficients are shown in
Table 4. This analytical approach allows us to comprehensively examine the relationships between LIFE and the different input variables. By assessing the magnitude and direction of these coefficients, the study gains insight into how factors such as GDP, urbanisation, CO
2 emissions, education and renewable energy influence LIFE in the context of smart cities.
The coefficient of determination (R-squared = 0.693) indicates that about 69.3% of the variability of the dependent variable can be explained by the independent variables included in the model. To ensure the robustness of the analysis, it is important to check for multicollinearity between the independent variables. In the study, a thorough check of multicollinearity is conducted using the variance inflation factor (VIF). Identifying and dealing with multicollinearity is crucial to ensure the reliability and interpretability of the regression results.
The next two columns show models where Log_CO2 is treated as an endogenous variable. The instrumental variables are Log_RES, Log_TRANS column Ivreg (2-IV). In the Ivreg (3-IV) column, three instrumental variables are considered: Log_RES, Log_TRANS and Log_GDP. However, in both cases, good results were not obtained in the Test of overidentifying restrictions. This means that these models are not correctly specified. Also, the R-squared is very low in the case of Ivreg (3-IV).
To examine the presence of multicollinearity among variables, we calculated variance inflation factor (VIF). The results are presented in
Table 5.
There are different recommendations for acceptable values of the VIF in the literature. While a maximum VIF value of 10 is generally advocated [
53], some researchers suggest stricter thresholds, e.g., VIF values that do not exceed 5 [
54] or even 4 [
55]. The analysis shows that the VIF values obtained are below these recommended thresholds. Moreover, Zuur et al. [
56] suggest, that VIF of 3 or more may no longer be considered suitable. The achieved VIF values are below the recommended values, so we can work with all variables in the following procedure.
Moreover, we use the LASSO (Least Absolute Shrinkage and Selection Operator) method to confirm this conclusion. LASSO reduces the number of explanatory variables and can be a tool for selecting appropriate regressors. Lambda λ is a penalty parameter that determines the magnitude of regularisation, Knots are the values of λ at which variables in the model change. The method involves adding variables step-by-step (see Action column), while minimising Akaike Information Criterion (AIC) and taking into account model goodness-of-fit (R-square), as captured in
Table 6.
The regression coefficients obtained by the LASSO method are shown in
Table 7. According to the regression coefficients, Log_GDP, Log_TRANS, Log_RECY, and Log_EDU are the most impactful variables, while others (Log_CO
2, Log_RES, and Log_Urban) have minor effects.
Comparing the LASSO and Post-est OLS, the magnitude and direction of relations are consistent, with only slight variations.
Consequently, the study continues with all variables in the subsequent analysis as their intercorrelations are within acceptable limits. This ensures the reliability and validity of the regression model and facilitates a comprehensive examination of the relationships between the variables of interest. Subsequently, a wide range of mathematical and statistical techniques are used in the analysis. The methodological procedure is illustrated in
Figure 1.
5. Results and Discussion
Fisher ADF, Fisher PP and CIPS unit root test of [
57] are used to investigate the stationarity of the variables. With the results of these tests, it is determined that the series is stationary at the first difference, I (1) (
Table 8).
The p-values are reported for most tests, except for the CIPS test, where the test statistic is calculated. The critical values for the CIPS test at different significance levels (1%, 5%, and 10%) are −2.81, −2.66, and −2.58. The majority of tests, except Fisher (ADF) for the Log_TRANS variable, show that the variables are integrated of the first degree (I (1)). Some tests even label the variables as stationary.
Moving on to cointegration tests, these tests aim to determine whether there exists a long-term stable relationship between variables. Cointegration implies that economic time series, after short-term fluctuations, converge back to an equilibrium state.
The cointegration tests proposed by [
58,
59,
60] are applied to panel data in this study. The results, which are presented in
Table 9, predominantly support the hypothesis of cointegration among the variables.
In the following analysis, the study estimates the coefficients for both fixed-effects (FE) and random-effects models. The coefficients derived from the FE model are shown in
Table 10. When we compare the coefficients from multiple regression analysis, we can observe more significant coefficients. In particular, overall CO
2 emissions and transport emissions negatively affect life expectancy, while GDP, renewable energy sources, education and recycling rates have a positive effect on life expectancy. However, the coefficients for the Urban variable are not statistically significant, according to our models.
The next two columns show models where Log_CO2 is treated as an endogenous variable. The instrumental variables are Log_RES, Log_TRANS column xtIvreg (2-IV). In column xtIvreg (3-IV) three instrumental variables are considered: Log_RES, Log_TRANS and Log_GDP.
The choice of the FE model over the random effects model is supported by the results of the Hausman test and the Breusch-Pagan-Lagrange multiplier test for random effects, indicating the suitability of the FE model for the analysis. The results of these diagnostic tests are presented with the residual tests for the FE model in
Table 11.
The test statistic for Frees’ test of cross-sectional independence is 5.749, significantly exceeding the critical values derived from the Frees’ Q distribution at different significance levels: for α = 0.10, a critical value of 0.2559, for α = 0.05, a critical value of 0.3429 and for α = 0.01, a critical value of 0.5198. These results of the analysis of the residuals of the FE model indicate the presence of cross-sectional dependence (CSD) and heteroscedasticity in the data.
The study uses two advanced estimation methods to mitigate these econometric problems: the Common Correlated Effects (CCE) estimator and the instrumental variable estimator with common factors (2SIV). The CCE estimator is specifically designed to address the problem of cross-sectional dependence as it accounts for unobserved common factors that may influence the dependent variable across different cross-sections. Meanwhile, the 2SIV method addresses endogeneity concerns by using common factors as instrumental variables. This helps us deal with the correlation between the predictors and the error term while also considering heteroscedasticity.
Evaluating the analysed variables for CSD is an essential step in the analysis. For this purpose, the study uses the estimation of the exponent of CSD alpha (α) and the CD test developed by Pesaran [
60]. In addition, Chudik et al. [
61], cited in Ditzen [
62], distinguish four categories of CSD: weak (α = 0, α = 0), semi-weak (0 < α < 0.5, 0 < α < 0.5), semi-strong (0.5 ≤ α < 1, 0.5 ≤ α < 1) and strong (α = 1, α= 1) CSD. The hypotheses for the CD test are formulated as follows: The null hypothesis (H0) states that a variable has a weak CSD. In contrast, the alternative hypothesis (H1) states that a variable has a strong CSD. The CSD outcomes are shown in
Table 12.
The estimated alpha coefficients (α) for all variables analysed are significantly above the established threshold of 0.5, indicating the prevalence of a strong CSD. In addition, the p-values associated with the null hypothesis in the CD test are sufficiently low to support the conclusion that the variables analysed have a strong CSD. This observation suggests that the variables are influenced by common factors or common shocks that affect multiple units within the panel, implying that the individual observations are not completely independent of each other.
Determining whether the slope coefficients are homogeneous (uniform across units) or heterogeneous (varying across units) is fundamental to econometric analysis as it directly influences the choice of estimation methods. In scenarios where the slope coefficients are homogeneous, several econometric techniques are applicable, including FE, RE, the generalised method of moments, and methods that account for structural breaks. Conversely, alternative methods are justified for models with heterogeneous slope coefficients. Neglecting to account for the heterogeneity in the slope, if it exists, can result in unreliable outcomes, as pointed out by Pesaran and Smith [
63]. Therefore, accurately identifying whether there is heterogeneity or homogeneity in the slope is essential for choosing the econometric model.
To determine whether there is heterogeneity or homogeneity in the slope within the dataset, this study utilises the Blomquist and Westerlund test [
64], as explained by Bersvendsen and Ditzen [
65]. This method enables us to thoroughly examine slope homogeneity while adjusting for CSD, heteroscedasticity and autocorrelation through applying CR and HAC options for error estimation. The test yields a delta statistic of 15.994 with a
p-value of 0.000, strongly suggesting that the slope coefficients are heterogeneous across units.
Given the rejection of the null hypothesis (H0) of slope homogeneity, the analysis requires estimation techniques that can account for slope heterogeneity. For this purpose, the heterogenous panel estimation technique is required, especially in MG type models. This technique is designed to effectively address the issue of variability of coefficients across the panels.
Statistical procedure for factor estimation was employed to accurately estimate the number of common factors. The estimation results, showcasing the calculated number of factors considering up to eight possible factors, are detailed in
Table 13.
The initial estimates, as given in the first three rows of
Table 13, are based on different penalty functions: PC_{p1}, IC_{p1}, through IC_{p3}, as introduced by Bai and Ng [
66]. These researchers developed three different penalty functions to estimate the number of factors in factorial models, namely PC_{p1}, PC_{p2}, and PC_{p3}. The basic idea behind these functions is to achieve an optimal balance between the goodness of fit of the model and its complexity and thus determine an appropriate number of factors. However, it should be noted that, as suggested by Ditzen [
67], these estimates tend to overestimate the true number of factors. In addition, the GOL estimator is not included in the analysis, although it is well suited for testing residuals. Therefore, the study primarily focuses on identifying a more conservative estimate of the number of factors between one and three.
To meet the challenge of a strong CSD within panel data, the study employs a CCE estimator as developed and refined by Ditzen [
68]. This approach operationalises the procedures described in the seminal work of Chudik and Pesaran [
69] and the basic principles outlined by Pesaran et al. [
70]. The CCE estimator, an integral part of this command, requires the inclusion of a sufficient number of lags of the cross-sectional averages in each panel equation. Importantly, these means must be at least as large as the number of unobserved common factors affecting the panel. A key advantage of the CCE estimator is its ability to effectively account for unobserved common factors without requiring precise information about their total number.
In parallel, the study employs the 2SIV method, a recent development described in detail in the works of Norkutė et al. [
71] and Cui et al. [
72]. The 2SIV method separates common factors from the error term and the regression variables in two steps. The 2SIV algorithm, introduced by Kripfganz and Sarafidis [
73], first uses PCA to remove common factors from the regressors. In the second stage, the model residuals are then calculated using the first-stage parameters. The model is further adjusted for the factor residuals before another IV regression is run with the obtained instrumental variables. The comparative effectiveness of these approaches—CCE and 2SIV—Is shown in
Table 14.
In the first column, all the regressors were chosen as mean group variables (7), and all the variables were chosen as cross-sectional mean variables (8). The second column contains the coefficients of the CCE model, where the endogenous variable Log_CO2 and the two instrumental variables Log_RES and log_TRANS have been selected. The fourth column presents the parameters of the 2SIV mean group estimator, where factmax = 3, the number of instruments is 24, and there is 1 factor in X. The lag (0–2) of all regressors was used to calculate the instrumental variables in the fourth column. All models have a small number of significant coefficients. The third model has no significant coefficient. The CD test and the alpha coefficient were used to test the cross-sectional dependence of the CCE model residuals. Acceptable results in terms of weak cross-sectional dependence are obtained in the first and second columns. Since the results obtained from the CCE and 2SIV models are not satisfactory, a partial heteroskedastic approach addresses the issue.
The second column presents the parameters of the Mean-group estimator, where factmax = 3, the number of instruments is 24, and 1 factor in X. Lag (0–2) of all regressors are utilised in the calculation of instrumental variables in the second column.
Despite the rigorous application of these models, only a limited number of coefficients were found to be statistically significant. Within the CCE framework, only the coefficient for Log_Urban is significant, while the 2SIV approach identified significant coefficients for Log_CO2 and Log_Trans, all of which hurt life expectancy.
Further examination of CSD within the CCE model residuals, using the CD test and estimation of the alpha coefficient, indicated only weak CSD. Despite these nuanced approaches, the results obtained from the CCE and 2SIV models are not entirely satisfactory, leading us to consider a partially heterogeneous approach as a possible solution.
The traditional assumption that model coefficients are homogeneous across entities in a panel data model is often theoretically and practically difficult to justify. Conversely, the assumption of complete heterogeneity—Where each entity has its regression coefficients—may not always be appropriate, as it overlooks the potential for generalisable patterns across entities. Sarafidis and Weber [
74] critique these conventional approaches, suggesting that both the homogeneous (pooling) and fully heterogeneous models represent polar extremes. They argue for exploring intermediate solutions, which are likely to provide more realistic and nuanced insights in many practical scenarios.
In this vein, Sarafidis and Weber [
74] propose a partial heterogeneity framework for panel data analysis. This framework allows for varying degrees of heterogeneity across units, recognising that while differences exist, they may not be as pronounced or uniform as the extremes of complete homogeneity or heterogeneity suggest. Christodoulou and Sarafidis [
75] further operationalise this concept by introducing the command (xtregcluster) in Stata, which implements a panel regression clustering approach. This method segments entities into Ω clusters, within which regression coefficients are assumed to be homogeneous. The residual sum of squares (RSS) for each cluster ω is denoted as RSSω, and the total RSS is calculated as the sum of the RSSω for all clusters, i.e., RSS =
. The optimal number of clusters, Ω, is determined by minimising the Model Information Criterion (MIC), which can be shown in Equation (1):
where
represents the average length of the time series for the panel. For panels with equal time-series length, it simplifies to
= T. The term f(Ω) Θ
N serves as a penalty function to prevent overfitting by penalising the inclusion of excessive clusters. Here, f(Ω) is a strictly increasing function of the number of clusters, with the default setting f(Ω) = Ω and Θ
N = 13 log N + 23 N. Other common values for Θ
N are ln (N) and
.
The next step is identifying the most appropriate regression clustering model for the panel data. The MIC values for potential clustering configurations, ranging from 1 to 10 clusters, are presented in
Table 15.
The lowest MIC value (−297.344) is obtained for 3 clusters. The values of the model coefficients for each cluster the overall model (Pooled), and other model characteristics and tests are shown in
Table 16.
The composition of the clusters is as follows: First cluster: Comprising only three EU countries, BGR, EST and LVA, this cluster is characterised by the significant influence of urbanisation and transport (specifically the average CO2 emissions per km of new passenger cars) on life expectancy, both of which have a negative effect. Conversely, the use of renewable energy sources has a significant positive effect. GDP and education also contribute positively, but their effects are not statistically significant in this cluster. CO2 emissions and recycling rates show negative tendencies, which are not significant either.
Second cluster: Comprising AUT, BEL, CYP, DEU, GRC, HUN, LTU, NLD, IRL, and SWE, this cluster is the most significant. All variables, except urbanisation, have a significant impact on life expectancy. In particular, GDP, recycling rate, renewable energy consumption and education level increase life expectancy. On the other hand, CO2 emissions and transport emissions contribute to decreased life expectancy. Urbanisation shows a positive trend but does not reach statistical significance.
Third cluster: This cluster is the greatest, comprising 12 countries, namely CZE, DNK, ESP, FIN, FRA, ITA, LUX, POL, PRT, ROU, SVK, and SVN. Within this cluster, we can observe significant positive effects of GDP and education on life expectancy. Similarly, as in previous clusters, transport emissions hurt life expectancy. Visualisation of the cluster analysis results is captured in
Figure 2.
The impact of smart city indicators depicted in
Figure 2 is distinguished by different colours. In particular, green signifies that the variable leads to an increase in life expectancy at birth, red indicates that the variable decreases life expectancy at birth and grey denotes variables whose impact is statistically insignificant.
The scatter plot between ROA and linear prediction is shown in
Figure 3, with different clusters indicated by different colours and each cluster associated with a line segment.
Figure 4 shows the geographical distribution of countries into the three clusters based on the performed analysis.
The coefficients of determination are remarkably high in all cases. In particular, the coefficients of determination for individual clusters exceed the coefficient of determination for the pooled model. This means that the regression models fitted to each cluster explain a greater proportion of the variance in the dependent variable than the model that includes all clusters together. These findings suggest that commonly used panel data approaches can derive general results that do not reflect the specificities of analysed units and the heterogeneity in the relationships between variables across clusters. On the other hand, cluster analysis provides more accurate results as it considers cluster-specific characteristics that more effectively reflect the context of analysed countries.