Classification of Regional Healthy Environment and Public Health in China

Environmental pollution has become a hot topic of concern for the government, academia and the public. The evaluation of environmental health should not only relate to environmental quality and exposure channels but also the level of economic development, social environmental protection responsibility and public awareness. We put forward the concept of the “healthy environment” and introduced 27 environmental indicators to evaluate and classify the healthy environment of 31 provinces and cities in China. Seven common factors were extracted and divided into economic, medical, ecological and humanistic environment factors. Based on the four environmental factors, we classify the healthy environment into five categories—economic leading healthy environment, robust healthy environment, developmental healthy environment, economic and medical disadvantageous healthy environment and completely disadvantageous healthy environment. The population health differences among the five healthy environment categories show that economic environment plays a major role in population health. Public health in regions with sound economic environment is significantly better than that in other areas. Our classification result of healthy environment can provide scientific support for optimizing environmental countermeasures and realizing environmental protection.


Introduction
Climate change and other interdependent environmental disruptions caused by human activity have been recognized as the greatest threats ever faced by us [1]. The most direct and easily felt consequence of environmental pollution is the degradation of the quality of the human environment, affecting human physical health and production activities. For example, long-term exposure to PM 2.5 might harm residential health and can increase premature death from specific diseases [2], as well as significantly increase the population's risk of heart disease, stroke, chronic obstructive pulmonary disease and lung cancer [3,4]. Improvements in air quality are associated with greater life expectancy, reduced infant mortality, higher property values, increased productivity, higher earnings and other benefits [5]. Water pollution worsens the quality of the water environment and generally degrades the quality of drinking water sources, threatening people's health and causing infertility and fetal malformations. In addition, exposure to bioaccumulated antibiotics poses serious health risks to ecosystems and humans [6].
This relationship between the environment and human health has been prominent amongst the concerns of international organizations including the World Health Organization (WHO), the World Bank, the United Nations, the Organization for Economic Co-operation and Development and the European Union (EU) [7]. Current research on how the environment affects residents' health mainly falls into two categories. The first category is outdoor environment, which is divided into natural environment such as air quality, water pollution and social environment. For example, aquatic environments have selected for human health included child mortality rates due to acute diarrhea and acute respiratory tract infections.
We can find from the previous studies the following. First, scholars mainly discuss the impact of a certain type of environment, such as natural environment or indoor and outdoor environment on residents' health, and few of them measure them from multiple aspects. Second, the existing indicators to measure residents' health are more focused on the death rate or prevalence rate of certain diseases, lacking integrity and representation of the population. Third, the assessment concentrates on the level of regional environmental health risk, with assessment results of high or low risk, and fail to clarify the environmental risk characteristics through further segmentation. Generally, the environment includes a variety of environmental categories such as natural environment, cultural environment and economic environment, but the "healthy environment" is rarely put forward. What is a "healthy environment"? We define "healthy environment" as a healthy, safe and comfortable social environment that can secure for the public a good physical and mental state by meeting the public's physiological, psychological and social demands. Therefore, the evaluation of a "healthy environment" should not only be linked to ecological factors but also be integrated into the overall evaluation system along with the social and economic environment, the medical environment and other aspects of human life. The study of the healthy environment index system provides theoretical guidance and a qualitative and quantitative basis for us to measure and judge healthy environments.
Based on various indicators such as the economic and social environment, we evaluated and classified the healthy environment of 31 provinces and cities in China, discussed the characteristics of different kinds of healthy environment, and discussed their relationship with public health. The main possible innovation in our work lies in the concept of the "healthy environment" and its classification. Through four types of environmental factor, we evaluated and classified regional healthy environments. We believe that this triage is really important because public health can behave differently in different types of health settings. For example, Mitchell [28] concludes that physical activity in natural environments is more associated with a reduction in the risk of poor mental health than in other environments; activity in different types of environment may promote different kinds of positive psychological responses.

Variable Selection and Data Description
Indicators are measurements selected to represent a larger phenomenon of interest to the researcher [26], in our case, the relationships between population health and a healthy environment. However, the public's living environment is not a self-contained, homogenous entity, but a complex system characterized by a number of different structures (e.g., educational, economic, mobility and political structures) that have their own dynamics and interact with each other in a complex grid [29]. Therefore, the index system of the healthy environment proposed consists of four parts-medical environment, economic environment, ecological environment and humanistic environment ( Table 1).
The medical environment mainly reflects medical infrastructure and public health awareness, including the proportion of maternity insurance, the number of practicing (assistant) physicians per 1000 persons, the number of health institutions and so on. The economic environment not only takes basic economic indicators into account, but also includes consumption, expenditure and disposable income per capita to reflect residents' income and consumption levels. The ecological environment includes measures of the energy intensity and emission intensity of polluting gases, which represent the environmental factors that may affect regional air quality and population health. In addition, urban environmental protection investment represents governments' emphasis on environment, and the water resources and green park area per capita are used to represent the regional environmental quality. The humanistic environment covers educational attainment level, internet penetration rate and so on. Finally, 27 indicators were selected to build the whole healthy environment system ( Table 1). The data cover 31 provinces and municipalities in China from 2010 to 2020, and mainly come from the China Statistical Yearbook, China Environmental Statistical Yearbook and China Health Statistical Yearbook. Factor analysis can be crudely described as an extension of the correlational method. When several variables are found to be rather highly correlated, it may be inferred that they are connected in some way, perhaps by a common underlying variable which is not immediately present in the measurements, but which nevertheless would account for them to a major extent [30]. Therefore, factor analysis is a method to find a few random variables that can synthesize the main information of all variables by studying the internal dependence of the correlation matrix among multiple variables. These random variables cannot be directly measured, so they are usually called common factors, and each common factor is irrelevant to the others. All variables can be expressed as a linear combination of common factors. The purpose of factor analysis is to reduce the number of variables and replace all variables with a few common factors. Equipped with n samples and p indicator, X = X 1 , X 2 , . . . , X p T is the random vector, and F = (F 1 , F 2 , . . . , F m ) T is the common factor.
The above model is called the factor model. Matrix A = a ij is called the factor load matrix, a ij is the factor loading, and its essence is the correlation coefficient between common factor F i and vector X j , which means how much variable X j depends on F i . ε j is a special factor, representing the variable variation caused by the other unexpected influencing factors. The task of factor analysis is to analyze the internal correlation structure between variables. Larger samples outperform smaller samples due to the reduction in the probability of errors. Various recommendations pertaining to sample size can be found in the literature. While some authors highlight the importance of absolute sample size, most researchers focus on the ratio between subjects and variables and recommendations frequently include ratios of 5:1 or 10:1 [31]. In addition, the variables should be correlated. If variables are independent of each other, common factors cannot be extracted. This can be determined by Bartlett's sphericity test and a correlation indicator, KMO test statistic, whose value is between 0 and 1. The closer the statistical value is to 1, the stronger the partial correlation between variables is, and the better the effect of factor analysis is. During the actual analysis process, when the KMO test statistic is greater than 0.7, factor analysis will generally present a better result.
The principal component method is often used to extract common factors. This method assumes that the variable is a linear combination of various common factors, to make the variance of the variable be explained by the principal component as much as possible, and ensure that the interpretation ratio of each common factor decreases successively. In the total variance interpretation table, common factors are extracted according to the default criterion of eigenvalues greater than 1. In addition, it should be emphasized that each common factor in factor analysis should have practical significance. To make the coefficient in the factor load matrix more significant, the initial load matrix can be rotated to redistribute the relationship between the common factor and the original variable, so that the absolute value of the correlation coefficient is differentiated to the two ends of the interval (0,1), and obtain more explicit results and make the interpretation of each common factor more meaningful. The commonly used factor rotation method is maximum variance orthogonal rotation (varimax), which maximizes the variance difference of common factors as far as possible to facilitate the interpretation of factors.

K-Means Clustering Method
We use a non-hierarchical clustering method to divide the categories into 5 categories. The purpose of non-hierarchical clustering is to quickly divide cases into K categories. Generally, the specific number of categories needs to be determined before classification, and the entire analysis process is carried out iteratively. The K-means clustering steps are as follows: 1.
Firstly, we determine the cluster number and divide the categories into 5 categories according to different environmental factors.

2.
According to the specified clustering center, or the center of the structure of the data itself, we set the initial clustering center. 3.
Next we calculate the distance between each case and the initial clustering center, classify them into each category according to the principle of minimum distance, and calculate the new clustering center of each category. Euclidean distance is commonly used to measure the distance between sample X i = X i1 , X i2 , . . . , X ip T and X i = X j1 , X j2 , . . . , X jp T ; the formula is: According to the new clustering center, we recalculate the distance between each case and the new clustering center, and reclassify and update the category clustering center. 5.
Step (4) is repeated until the moving distance of all clustering centers is less than 2% of the minimum moving distance of the initial clustering center, or the maximum number of iterations specified is reached.

Factor Analysis Result
The exploratory factor analysis model was used, and Bartlett's spherical test was significant (p < 0.01),and the KMO test statistical value is 0.828 (Table A1), meaning a better information overlap among variables. Table A2 shows the variance of the common factor, and the information extraction proportion of most variables is above 80%, indicating that the proposed common factor has a strong explanatory ability for most variables. The common factor was extracted according to the default standard with an eigenvalue greater than one. Combined with the lithotripsy diagram ( Figure 1), which is used to show the importance of each common factor, seven common factors are finally extracted. The horizontal axis of the lithotripsy diagram is the number of common factors, and the vertical axis is the eigenvalues which are arranged according to the eigenvalue order. The cumulative variance contribution rate of seven common factors was 85.909% (Table A3). After rotation, the information was redistributed with an unchanged cumulative variance contribution rate. The variance contribution rate of each of the seven common factors changed, and the gap between them decreased. The component matrix after rotation is shown in Table 2.    Figure 1. Lithotripsy diagram. In the rotated component matrix, each variable is sorted according to the coefficient, and the absolute value of a coefficient less than 0.3 will not be output. Combining the results in Table 2 and the indicator classification in Table 1, we try to classify each factor into one that can reflect the economic environment, medical environment, ecological environment and humanistic environment. It can be seen that the first common factor F 1 has a large load in reflecting the overall economic situation, such as X13 household consumption expenditure per capita, X14 disposable income per capita and X11 GDP per capita, so common factor F 1 can be named as the economic environment factor. The second common factor F 2 has a large load in social, medical and health infrastructure, including X5 the number of health institutions and X6 community-level medical institutions, so we name F 2 as the medical environment factor. F 3 to F 6 have a large load on the indicators reflecting production and ecological investment, such as X20 water resources per capita, X15 energy intensity and X17 NO emission intensity, so they are named as ecological environmental factors. F 7 mainly includes population density and is named the humanistic environment factor. Each variable can be expressed as:

Component
It should be pointed out that the variables above are normalized variables. The above functions represent each variable as a linear combination of common factors, but it is necessary to express the common factor as a linear form of each variable, which is also called the factor score function. Usually, the regression method is adopted to estimate the factor score. The essence of the regression method is to establish a regression equation between the original variable and the common factor. The component score coefficient matrix is shown in Table A4. The expression of each common factor is: Through the above function, the environment score of seven common factors can be calculated. Since the above seven common factors reflect the assessment level of local healthy environments in different aspects, we take the proportion of the variance contribution rate (in Table A3) corresponding to each common factor as the weight to calculate the comprehensive score, namely:

Cluster Analysis Results
Based on the common factor score, we gain four environmental factor scores, economic environment factor score S Economic , medical environment factor score S Medical , ecological environment factor score S Ecological , humanistic environment factor score S Humanistic and total score S. In order to explore the characteristics of different categories of healthy environment, we classified the scores of four environmental factors through the K-means clustering method. Based on the distance between each case after classification and the clustering center and combining the practical meaning of each category, we finally divide samples into five categories. The initial clustering center is shown in Table A5, the essence of which is the score of each environmental factor of a certain five cases in the sample. Table A6 show the iterative process records and the change of cluster center is obviously smaller until it finally approaches zero. We set the convergence criterion as 0.02. The iteration stops when the full iteration is unable to move any cluster centers by 2% of the minimum distance between any initial cluster centers. The whole iteration process is terminated at the seventh step, achieving convergence. The final clustering center is shown in Table 3, the essence of which is the average value of the score in each category, and can be used to describe the characteristics of five different types of healthy environment in the four environmental factors. Figure 2 shows the distribution of environmental factors in each healthy environment category. Category C has the largest number of cases, accounting for 43.7%, while Category E has the smallest number. Sometimes the number of classified cases can play an auxiliary role in determining the final category characteristics.  The most obvious weakness of cluster analysis is that cluster analysis can always obtain several types of results regardless of whether there are different categories in the data. Therefore, it is very important to verify the validity of clustering results. The ANOVA result in Table A7 shows that the environment factors scores are statistically different in different healthy environments. Figure 3 shows the mean scores of five healthy environment categories. The distribution of four environmental factor scores in various categories is shown in Figure  4. Table A8 and Figure A1 present the environmental factor scores and classification The most obvious weakness of cluster analysis is that cluster analysis can always obtain several types of results regardless of whether there are different categories in the data. Therefore, it is very important to verify the validity of clustering results. The ANOVA result in Table A7 shows that the environment factors scores are statistically different in different healthy environments. Figure 3 shows the mean scores of five healthy environment categories. The distribution of four environmental factor scores in various categories is shown in Figure 4. Table A8 and Figure A1 present the environmental factor scores and classification results of each region in 2020. In terms of the four environmental factor scores, the five healthy environments are differentiated. Category A is significantly ahead of other regions in economy, followed by Category B, while the others are relatively backward in economic environment, especially Category D and Category E, with significant economic disadvantages. None of them scored well in the medical environment factor. Relatively speaking, Category B has greater advantage in the medical environment, followed by Category C. Category A, D and E are lagging behind in the medical environment. In terms of ecological environment factors, Category B, C and D have comparative advantages, while the other areas have obvious disadvantages, especially Category E, which show significant ecological disadvantages. In the humanistic environment factor, all regions scored around 0. Category B and Category E have relative disadvantages.  The heterogeneity of cities in different healthy environment categories indicates that the healthy environment in different regions has its special development rules, and presents various characteristics. Category A is economic leading healthy environment, with strong economic advantages, and its economic environment score is much higher

Population health and healthy environments
When selecting diseases that represent population health, we select maternal mortality rate (1/100,000) and the mortality rates of Class A and Class B notifiable infectious diseases (1/100,000) rather than typical environment-related diseases such as lung cancer and bronchitis, to study the impact of healthy environment types on public health. Table 4 presents the descriptive statistics of the two diseases under different healthy environment types. Due to the unevenness of the classification results, a large variance difference exists, and the sample did not necessarily meet the homogeneity of variance. Therefore, the Games-Howell method based on variance inequality was used for the comparison between different categories. The Games-Howell multiple comparison results are shown in Table 5. At the 5% significance level, there were significant differences in maternal mortality among different healthy environment categories, mainly reflected in that maternal mortality increased significantly with the decrease in the total healthy environment score. Take the robust healthy environment (Category B) as an example; its maternal mortality rate is 2.683/100,000 higher than that in economic leading healthy environments (Category A), The heterogeneity of cities in different healthy environment categories indicates that the healthy environment in different regions has its special development rules, and presents various characteristics. Category A is economic leading healthy environment, with strong economic advantages, and its economic environment score is much higher than other areas. Although it has certain disadvantages in medical environment and ecological environment, its superior economic environment still makes its average total score far higher than other categories, with an average value of 0.9917. Category B is a robust healthy environment with a total score of 0.3774, lower than Category A. A robust healthy environment has great comparative advantages in four environmental factors, especially in economic advantages. Despite the disadvantage in the humanistic environment, it is the most stable type of healthy environment. Category C is a developmental healthy environment with a mean total score of 0.0324, showing a weak comparative advantage. A developmental healthy environment has light comparative advantages in medical, ecological and humanistic environments. It has significant economic disadvantages and still needs to be further improved economically. Compared with the robust healthy environment, the developmental healthy environment lags behind economically. Category D is the economic and medical disadvantageous healthy environment, and the average score is −0.4430. Although it has obvious economic and medical disadvantages, it has certain ecological and humanistic advantages. Category E is a completely disadvantageous healthy environment and has the lowest mean score of −1.1964, with four negative environmental factor scores, and it has absolute ecological and economic environment disadvantages. Meanwhile, the disadvantages of the medical environment and the humanistic environment are also significant.

Population Health and Healthy Environments
When selecting diseases that represent population health, we select maternal mortality rate (1/100,000) and the mortality rates of Class A and Class B notifiable infectious diseases (1/100,000) rather than typical environment-related diseases such as lung cancer and bronchitis, to study the impact of healthy environment types on public health. Table 4 presents the descriptive statistics of the two diseases under different healthy environment types. Due to the unevenness of the classification results, a large variance difference exists, and the sample did not necessarily meet the homogeneity of variance. Therefore, the Games-Howell method based on variance inequality was used for the comparison between different categories. The Games-Howell multiple comparison results are shown in Table 5. At the 5% significance level, there were significant differences in maternal mortality among different healthy environment categories, mainly reflected in that maternal mortality increased significantly with the decrease in the total healthy environment score. Take the robust healthy environment (Category B) as an example; its maternal mortality rate is 2.683/100,000 higher than that in economic leading healthy environments (Category A), and 6.504/100,000, 11.691/100,000 and 107.388/100,000 lower than the other three types of healthy environment, respectively. Similarly, at the 5% significance level, the mortality rate of Class A and Class B notifiable infectious diseases was significantly different among the five healthy environment categories. Specifically, the mortality rate of Class A and Class B notifiable infectious diseases in the economic leading healthy environment (Category A) is 0.5822/100,000 and 1.3801/100,000 lower than that in Category C and Category D, respectively.

Discussion
The five categories of healthy environment are economic leading healthy environment, robust healthy environment, developmental healthy environment, economic and medical disadvantageous healthy environment and completely disadvantageous healthy environment. There is a significant imbalance in regional economic development; only a few regions have a large economic advantage. All regions perform well in ecological environment, but the medical and humanistic environment needs to be improved.
The economic leading healthy environments include Beijing, Shanghai and Tianjin. These regions have significant advantages in the economic environment and complete disadvantages in the medical environment, and also have certain disadvantages in ecological and humanistic environments. Health indicators in these regions perform better than in other regions. It should be noted that due to the limitation of indicators, we only introduce basic medical facilities as the medical environment, and fail to take indicators that reflect local rich medical resources, such as medical equipment and doctors' qualifications into account. We still insist that the results of our work are reasonable, since rich medical resources attract more patients, which means higher medical pressure and will squeeze on the medical care of local residents.
Robust healthy environments include Guangdong, Zhejiang, Jiangsu and so on, which have relative economic advantages and no obvious disadvantages in medical, ecological and humanistic environments. The maternal mortality rate in robust healthy environments was slightly higher than that in the economic leading environments and there was no significant difference in the mortality rate of Class A and Class B notifiable infectious diseases. Developmental healthy environments are mainly concentrated in Henan, Anhui and so on, with an economic disadvantage, and dominant in medical, ecological and humanistic environments. The maternal mortality rate was significantly higher than that in the economic leading healthy environments and robust healthy environments, and there was little difference in the mortality rate of Category A and Category B notifiable infectious diseases. The completely disadvantageous healthy environment is mainly in Xinjiang, where the health condition is poor and the mortality rate of diseases is significantly higher than that of other regions. Due to the small sample size, we will not elaborate it to avoid errors.
Based on the population health differences among the five healthy environment categories, we believe that the economic environment plays a major role in population health. Public health in areas with a good economic environment is significantly better than that in other areas. A better medical environment and ecological environment can also bring positive effects on population health. To further improve healthy environment levels in China, on the one hand, we should focus on enhancing the economic environment and reducing the gap between regional economies. On the other hand, it is necessary to improve the ecological and humanistic environment to enhance the social environment, so as to improve public health levels.
Additionally, as can be seen in Figure A2, the regional healthy environment category has gradually developed in a better direction. Shanghai and Beijing have been economic leading healthy environments for a long time, while Qinghai and Tibet have shown greater weakness and basically remain in the same healthy environment category. Regional heterogeneity leads to different characteristics of health environment category changes in various regions. However, overall, economic factors play a significant role. Although economic development may bring about some health problems, our results suggest that higher economic levels still have a positive effect on population health. On the one hand, economic development will bring resources flow, including human and material resources and so on. This flow can create a development environment for the improvement of population health. On the other hand, residents in areas with higher economic levels generally have higher health literacy. They not only pay more attention to their health than people in economically disadvantaged areas, but also have a stronger ability and willingness to pay.

Conclusions and Policy Implications
The health environment evaluation itself should integrate theories of geography, ecology, climatology, social economics and other fields to guide the evaluation process. We put forward the concept of the "healthy environment" and divide 27 environmental indicators into economic environment, medical environment, ecological environment and humanistic environment to evaluate and classify the healthy environments of 31 provinces and cities in China. Based on the rotated principal component analysis matrix, seven common factors were divided into economic environment factors, medical environment factors, ecological environment factors and humanistic environment factors, to entitle factors to a practical meaning, and obtain scores of four environmental factors in each region. We classify the healthy environment into five categories based on the four environmental factor scores. Finally, the maternal mortality rate and the mortality rate of Class A and Class B notifiable infectious diseases are taken as population health indicators to discuss the population health differences under five healthy environments. In the context that current research is generally focused on the level of environmental health risk, it is meaningful to classify the healthy environment and figure out its characteristics. The overall health environment score in five healthy environments decreased successively, and public health also decreased gradually. Different healthy environment categories show different characteristics. Based on different environmental characteristics, we believe that the economic environment plays a major role in population health.
A good interactive relationship can be established between environment and health, and the understanding of this relationship will help decision-makers to have an insight into the possible consequences of policy implementation. The evaluation method of the healthy environment can provide scientific support for optimizing environmental countermeasures and realizing environmental protection and sustainable development of the economy and society. There are some deficiencies in selecting environment and health indicators. On the one hand, due to the availability of data, we can only take provinces and cities as the evaluation unit. On the other hand, the lack of detailed ecological environmental data such as air quality in the evaluation index may affect the accuracy of the evaluation results. Although there are some imperfections, it can still provide a new idea for the study of environmental health by constructing an indicator system and classifying the healthy environment. Declarations: Ethics approval: Not applicable; Consent to participate: Not applicable; Consent for publication: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Appendix A Figure A1. Regional healthy environment categories in 2020. Figure A1. Regional healthy environment categories in 2020.