Mental health is an inseparable aspect of modern human health, according to the World Health Organization (WHO), which includes “subjective well-being, perceived self-efficacy, autonomy, competence, inter-generational dependence, and self-actualization of one’s intellectual and emotional potential, among others” [
1].
Mental health has been widely studied in many fields, and genetic factors have become a focus of research because of their causal relationship with mental illness in pathology [
2,
3,
4,
5]. However, mental health problems are not only caused by mental illness. One fact that cannot be ignored is the interaction of multiple factors, among which environmental components play an important role that lead to mental health issues.
People are constantly exposed to different types of environments. Particularly, these environments can be roughly divided into physical and social environments. These settings can affect an individual’s mental health in a variety of ways, and the most studied fields are the spatial environments related to the neighborhood [
6,
7], but other environmental factors also have a non-negligible impact. Particulate pollution in the physical environment can cause inflammation of the central nervous system, thereby increasing the chances of mental problems [
8,
9]; smoking not only does not relieve stress but also increases the risk of mental illness [
10,
11]. Green spaces which refers to vegetation (trees, grass, forests, parks, etc.) can also have a large impact on mental health [
12,
13], as noise, crowding, and community escape exits in this physical space can affect mental health [
14]. Social environmental factors refer to socioeconomic, racial and ethnic, and relational conditions that may influence a person’s ability to cope with stress. The newly acquired social environment is more likely to cause changes in one’s mental state, such as a decrease or increase in social participation and integration in school, the workplace, and the community [
15]. These occurrences are due to the fact of poverty [
16,
17], the working environment, etc. [
18,
19]. It is difficult to explain the impact of a certain type of environment on mental health or to explain the actual impact of the complex interaction of factors on mental health. However, access to environmental data is indeed available to us, as sensors throughout the city provide a large amount of physical environmental data, and social statistics provide us with social environmental data. All of these data help us further explore the effect of environmental factors on mental health.
1.2. Urban Data Analysis
As early as 2008, IBM put forward the concept of the “Smart Earth: Next Generation Leadership Agenda” [
25]. Smart cities require the use of various information technologies and innovative concepts to improve resource utilization efficiency, optimization of city management and services, and achievement of corresponding goals. The maturity of information and communication technology enables artificial intelligence to explore the impact of environmental factors on cities centers such as urban computing [
26]. The mature artificial intelligence algorithms used in urban computing provides us with a reference for exploring the impact of the urban environment on mental health from another perspective. It can be seen that continued excavating the advantages of data analysis is an inexorable trend.
This study aimed to comprehensively analyze the correlation between environmental factors and urban mental health effects. However, the traditional method showed a powerless state in the face of a large amount of dynamic data. The prediction of an urban mental state effect involves a large amount of temporal and spatial data. To obtain a good prediction effect, we first needed to control our input. With sufficient data and time, it is fine to use all the input features, including those irrelevant features, to approximate the underlying function between the input and the output. However, in practice, two problems may be evoked by the irrelevant features involved in the learning process. These are: the irrelevant input features will induce greater computational cost and the irrelevant input features may lead to overfitting.
The feature selection problem has been studied by the statistics and machine learning communities for many years. In traditional statistical feature selection, it is mainly divided into filter methods and wrapped around methods [
27]. Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected based on their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here. In wrapper methods, we tried to use a subset of features and trained a model using them. Based on the inferences that we drew from the previous model, we decided to add or remove features from the subset. The problem was essentially reduced to a search problem. These methods are usually computationally very expensive.
In multi-feature selection, traditional feature selection methods are difficult to apply. Due to the special nature of urban mental and mental health state analysis, we could not predict a linear or nonlinear relationship between variables in advance, nor could we determine which of the many environmental factors would have a greater contribution. This forced us to use a machine learning algorithm with good generality and low computational complexity for feature selection.
The linear model had a strong explanatory power to the variables to better understand the relationship among the variables for interference, which is mostly used in the field of humanities and social sciences. Although mental state was used as a psychological indicator, the mental state associated to urban areas was affected by multiple factors. Therefore, it was difficult to adapt to linear models that were poor interpretation of the interactions between variables and weakly fit nonlinear relationships.
Although many methods and measures have been applied to linear models to improve their application capabilities, such as local weighting or regularization [
28], etc., as a complicated issue, environmental factors, the number of which can be quite high, cannot just use a linear model to explain their impact on urban mental health. Therefore, we used machine learning methods to study the influencing factors of urban centers’ effect on mental health under the multi-characteristics of big data, hoping to determine urban environmental factors worth studying. Through feature selection, mining urban environmental factors, and then confirming the high correlation between corresponding environmental factors and mental health by prediction results.
1.3. Determining the Environmental Factors Affecting Urban Dwellers’ Mental Health
The relationship between variables can be statistically divided into two categories: deterministic function relationships and statistical relationships [
29]. When we explore the relationship between environmental factors and urban effects on mental states, obviously, we cannot find a simple deterministic functional relationship. The combined effect of multiple factors is presented through statistical relationships, but the impact of different environmental factors on mental states is something we need to judge before making predictions. In linear models, because of the strong interpretation between variables, the method of feature selection also tends to judge linearly related measured indicators. However, in the face of high-dimensional data, it is difficult to use the prior knowledge of the researcher to select features and then use simple metrics to verify. Therefore, we chose the maximal information coefficient (MIC) [
30] as the measurement index. MIC is used to measure the degree of association between two variables including linear and nonlinear relationships, ranging from 0–1. Compared with current commonly used coefficients for measuring correlation, such as Pearson’s correlation coefficient [
31], distance correlation [
32], etc., MIC has a superior way to calculate data relevance, as it is a generality [
30]. “Generality” means that when the sample size is large enough or including the majority of the information of the sample, it can capture a variety of interesting associations, and it is not limited to specific function types such as linear functions, exponential functions, or the periodic function.
In addition to the linear model mentioned in the previous section, support vector machines are suitable for multi-factor modeling analysis due to the fact of their ability to model nonlinear relationships, good generalization performance, and the ability to achieve global unique optimization [
33]. Artificial neural networks deal with complex problems by simulating the interaction process of neurons [
34]. Therefore, by modeling three different levels of models, we could further test whether the selected environmental features reflected changes in urban dweller’s mental health. When the predicted values of the three models were far from the true values, it showed that we needed to adjust the selected features accordingly; when the model achieved good prediction results, it further confirmed the reliability of the environmental factors we selected.
In the study, using models to explore the relationship between the environment and mental health, the purpose of the research was to analyze the factors with significant influence to determine causal relationships. With this research, certain factors were often classified, and cross-sectional studies or multi-level analyses were used to confirm the true effect of research hypotheses in a certain scenario. Taking neighborhood environmental analysis as an example, research on the impact of neighborhood environmental impact on population health has existed for a long time and has shown exponential growth in the initial period to the amount of studies [
7]. Measuring the physical and social environment to explore the impact on the community’s mental health is important. Depression has been a mental health outcome most commonly studied concerning neighborhood characteristics.
Measuring mental health has often been regarded as more difficult than measuring other types of health. This is due in part to the limited availability of objective biological tests and variable diagnostic guideline offered by psychiatry, alongside the intercultural differences in the mental health experiences and complex social and psychological confounders. However, it is possible—and desirable—to measure mental health outcomes in environmental research. This is how the mental health impact of urban planning and design can be demonstrated and understood.
With the emergence of a large number of cities and collection of personal data, there are multiple methods to measure mental health. Support vector regression is the most frequently used prediction method. Linear regression also appears in some papers. In addition, with the promotion of machine learning technology, deep learning technology has become a trend, including straightforward deep neural networks and recurrent neural networks. Furthermore, research on measuring mental health through the environment is also emerging such as measuring mental health through the built environment [
35] and predicting the impact of economic factors on mental health, etc. [
36]. Therefore, we believe that after the environmental factors have been screened out, predictions through the above three models can reflect their correlation.
The promise of big data has made some in the scientific community lazily replace causality with correlation. In the urban mental state, the same correlation obtained through modeling cannot directly explain the role of environmental factors, especially in the interweaving of various environmental factors. However, the modeling of big data does give us the determination of relevance, which is equally important for discovering unknown causality. Therefore, this research focuses on establishing a starting point for future research, obtaining a series of environmental features through feature selection, and determining the credibility of the features based on the comparison between the predictive modeling results and the real situation.
Smarter London Together [
37] is a plan initiated by the Greater London Government to meet the needs and challenges of urban residents, workers, and travelers in the digital age. The London Datastore is an internationally recognized open data resource with over 700 data sets that provided data support for our collection of London environmental data. In addition, London’s technology and data on the environment, traffic, safety, etc., showed the data of the entire Greater London area, allowing us to analyze the mental health status of the Greater London area using borough as a unit, and obtain key factors affecting the overall environment of Greater London.