Can Urban Environmental Problems Be Accurately Identified? A Complaint Text Mining Method

With the popularization of social networks, the abundance of unstructured data regarding environmental complaints is rapidly increasing. This study established a text mining framework for Chinese civil environmental complaints and analyzed the characteristics of environmental complaints, including keywords, sentiment, and semantic networks, with two–year environmental complaints records in Guangzhou city, China. The results show that the keywords of environmental complaints can be effectively extracted, providing an accurate entry point for solving environmental problems; light pollution complaints are the most negative, and electromagnetic radiation complaints have the most fluctuating emotions, which may be due to the diversity of citizens’ perceptions of pollution; the nodes of the semantic network reveal that citizens pay the most attention to pollution sources but the least attention to stakeholders; the edges of the semantic network shows that pollution sources and pollution receptors show the most concerning relationship, and the pollution receptors’ relationships with pollution behaviors, sensory features, stakeholders, and individual health are also highlighted by citizens. Thus, environmental pollution management should not only strengthen the control of pollution sources but also pay attention to these characteristics. This study provides an efficient technical method for unstructured data analysis, which may be helpful for precise and smart environmental management.


Introduction
Environmental quality has become a critical factor for improving urban sustainability [1]. In the era of big data, smart cities provide citizens with a better living environment, which has become an emerging model of world city development. Its essence lies in the high integration of informatization and urbanization. With the rapid development of information technology and the increase in citizens' environmental awareness [2], it is more convenient to make environmental pollution complaints with the help of mobile phones and social networks. Citizens are more active in expressing their subjective feelings about environmental pollution. For example, in 2019, China's "12369" environmental protection reporting network management platform received more than 530,000 environmental complaints records from the public, of which Guangdong Province ranked second. Environmental complaint data are unstructured text data, which have different data analysis methods from traditional environmental sensor networks (such as the air also determine more precise countermeasures for the environmental management of smart cities. In this paper, civil environmental complaint records regarding six pollution topics (air, water, noise, waste, electromagnetic radiation, and light) from Guangzhou city are used, and a text mining framework for Chinese environmental pollution complaints is proposed. With this framework, we extract keywords, calculate the complainants' sentiment score, and analyze the characteristics of the semantic network from each class of pollution complaint. These results underline the positive impact of text mining on urban environmental management in both the current and future development of the smart city.

Study Area
Guangzhou city is the capital of Guangdong Province, located in the south of mainland China (Figure 1). Guangzhou city is a regional center city in southern China and one of the core cities of the Guangdong-Hong Kong-Macao Greater Bay Area (Greater Bay Area). There are 11 districts in Guangzhou city, and it has a total area of 7434.4 km 2 (2019). At the end of 2019, the resident population of Guangzhou was 15.30 million, and the GDP was RMB 2362.860. According to the list of key polluting firms in Guangzhou city, the number of such firms was 1147, 780, and 713 in 2018, 2019, and 2020, respectively.

Data Collection and Pre-Processing
The two-year data (from 1 March 2018 to 31 March 2020) were retrieved from the website of the Guangzhou Municipal Ecological Environment Bureau (http://sthjj.gz.gov.cn/ztlm/tsjbzx/, accessed on 31 March 2020). The complaints datasets contain the date, complaint ID, district and address, firms, topic of complaint, complaint content, government response, and response date (Table 1). We obtained 5672 valid records with missing geographic information, and unidentified complaint content was excluded.
Appl. Sci. 2021, 11, x 5 of 13 Figure 2 describes the text mining process framework for Chinese environmental complaints. For the sake of content analysis and text mining, we cleaned the collected text data, including removing non-text data (punctuation marks, emoticons, and meaningless symbols), invalid characters (letters and numbers), and meaningless text (function words and pronouns). We removed the meaningless text by using some open-source Chinese stop word dictionaries (e.g., Harbin Institute of Technology (HIT) stop words and Baidu TM stop words). Then, we carried out data processing, including keyword extraction, sentiment analysis, and semantic network analysis.

Keyword Extraction
Firstly, we used the Jieba Chinese text segmentation tool to segment the text records into meaningful words (https://github.com/fxsjy/jieba/, accessed on 25 January 2021). At this stage, synonym substitution and part-of-speech tagging were carried out to avoid the influence of different expressions of synonyms and meaningless function words on subsequent keyword extraction. In addition to the default corpus of the word segmentation tool, a domain dictionary for environmental complaints was established to jointly ensure the accuracy of word segmentation. Secondly, each type of complaint keyword was extracted based on the TF-IDF method [22], which is the most widely adopted word weighting scheme in text mining. It computes how significant a term t is to a document d by combining two scores, term frequency (TF) (2), which is the frequency of term t in document d, and inverse document frequency (IDF) (3), which is the number of documents in the corpus containing t regardless of its frequency. T is more important for d when its TF is large but its IDF is small. That is, words with high TF-IDF value are more important than other words in the documents, so they are the keywords that distinguish the document from others.
where f (t, d) is the number of times term t appears in a document, d is the total number of terms in the document, D is the total number of documents, and | | ∈ | is the number of documents with the term t in it.

Keyword Extraction
Firstly, we used the Jieba Chinese text segmentation tool to segment the text records into meaningful words (https://github.com/fxsjy/jieba/, accessed on 25 January 2021). At this stage, synonym substitution and part-of-speech tagging were carried out to avoid the influence of different expressions of synonyms and meaningless function words on subsequent keyword extraction. In addition to the default corpus of the word segmentation tool, a domain dictionary for environmental complaints was established to jointly ensure the accuracy of word segmentation. Secondly, each type of complaint keyword was extracted based on the TF-IDF method [22], which is the most widely adopted word weighting scheme in text mining. It computes how significant a term t is to a document d by combining two scores, term frequency (TF) (2), which is the frequency of term t in document d, and inverse document frequency (IDF) (3), which is the number of documents in the corpus containing t regardless of its frequency. T is more important for d when its TF is large but its IDF is small. That is, words with high TF-IDF value are more important than other words in the documents, so they are the keywords that distinguish the document from others.
where f (t, d) is the number of times term t appears in a document, d is the total number of terms in the document, D is the total number of documents, and |{ d|t ∈ d}| is the number of documents with the term t in it.

Sentiment Analysis
In this study, sentiment analysis was used to identify the citizen's sentiment in the six types of environmental complaints. Lacking inter-word spacing, the diversification of expressions, the complexity of grammar, and the randomness of length of the complaint record increase the difficulty of Chinese sentiment analysis.
Firstly, a sentiment dictionary was established, including a domain emotion dictionary of environmental complaints and some general Chinese sentiment dictionaries, such as Li Jun's Chinese commendatory and derogatory dictionary of Tsinghua University, National Taiwan University Sentiment Dictionary (NTUSD), Hownet Sentiment Dictionary. Meanwhile, the score of positive emotion words (Sp) was set to 1, and the score of negative emotion words (Sn) was −1 (Table 3). Table 3. Sentiment words and their weights.

Examples of Sentiment Words Emotion Weight
General Secondly, according to Hownet Dictionary, degree adverbs are divided into six levels. According to the weight value of the gradient descent Formula (4) [23], different weights are assigned to each level ( Table 4). The emotional intensity of the emotional words modified by adverbs increases by a certain multiple. Moreover, when inverse words such as scarcely (没有), never (从不), and seldom (很少), modify emotional words, the emotional words are multiplied by −1.
where, A w = 3 is the weight of the "most" level; n is the gradient descent rate. Table 4. Degree adverbs and its weights.

Level Examples of Adverb (A) and Inverse Words (N) Weight (Aw)
Most Finally, one complaint record (a compound sentence) is divided into multiple clauses by punctuation, and the sentiment value of each clause (Ci) is calculated by the combination of sentiment words (S), adverbs (A), inverse words (N), and punctuation (!/?) (Table 5). Additionally, the sentiment value of each complaint record (Sj) is calculated by Function (5). Table 5 shows nine combinations in Chinese grammar.
where S j is the sentiment value of the j complaint record, L j is the clauses' number of j complaint records, and Ci is the sentiment value of the i clause in the j complaint record. L j is used to eliminate the influence of the complaint record's length on the result. The sentiment value (S j ) is scaled in the range −1-1. S j > 0 means the sentiment of the Appl. Sci. 2021, 11, 4087 7 of 14 complaint is positive; S j < 0 means the sentiment is negative; S j = 0 means the sentiment is neutral.

Semantic Network Analysis
A semantic network consists of nodes (words) and edges (the relationship between words). The node's size (degree) is proportional to the number of words related to it; a thicker edge means a higher co-occurrence frequency or a closer relationship between the words. We used two-mode networks [24], including top and bottom nodes, to analyze the semantic network of each type of complaint. In our two-mode networks, keywords (bottom nodes) were categorized into three clusters (top nodes) based on pollution characteristics, stakeholders, or complainants. Furthermore, the pollution characteristics were categorized into three sub-clusters including pollution sources, pollution behavior, and sensory features; the stakeholders were categorized into two sub-clusters, including firms and administration; and the complainants were categorized into three sub-clusters, including pollution receptor, social life, and individual health. Figure 3 shows the workflow of semantic network analysis. Firstly, keywords were extracted based on the TF-IDF method. Secondly, a word co-occurrence matrix with environmental complaint keywords was constructed, and co-occurrence analysis was performed on them. Finally, the generated semantic network was plotted by Gephi software (version 0.9.2) [25].  (5) where is the sentiment value of the j complaint record, is the clauses' number of complaint records, and is the sentiment value of the clause in the j complaint record.
is used to eliminate the influence of the complaint record's length on the result. The sentiment value ( ) is scaled in the range −1−1. > 0 means the sentiment of the complaint is positive; < 0 means the sentiment is negative; = 0 means the sentiment is neutral.

Semantic Network Analysis
A semantic network consists of nodes (words) and edges (the relationship between words). The node's size (degree) is proportional to the number of words related to it; a thicker edge means a higher co-occurrence frequency or a closer relationship between the words. We used two-mode networks [24], including top and bottom nodes, to analyze the semantic network of each type of complaint. In our two-mode networks, keywords (bottom nodes) were categorized into three clusters (top nodes) based on pollution characteristics, stakeholders, or complainants. Furthermore, the pollution characteristics were categorized into three sub-clusters including pollution sources, pollution behavior, and sensory features; the stakeholders were categorized into two sub-clusters, including firms and administration; and the complainants were categorized into three sub-clusters, including pollution receptor, social life, and individual health. Figure 3 shows the workflow of semantic network analysis. Firstly, keywords were extracted based on the TF-IDF method. Secondly, a word co-occurrence matrix with environmental complaint keywords was constructed, and co-occurrence analysis was performed on them. Finally, the generated semantic network was plotted by Gephi software (version 0.9.2) [25].

Keywords of Environmental Complaints
The study used TF-IDF to extract keywords from six types of environmental complaints that indicated the characteristics of environmental complaints. The higher the TF-IDF value, the more important the word is in this type of environmental complaint. Table 6 shows the top 10 keywords of various environmental complaints, and we found that different environmental complaints show obvious differences and similarities characteristics of environmental issues. As the keyword list demonstrates, differences in environmental complaints with different topics are noticeable. The list of keywords related to air complaints has the highest TF-IDF value for typical words, such as lampblack (油烟), exhaust gas (废气), and odor (气味). Among the keywords of water complaints, sewage (污水) ranks first, followed by stench (恶臭), sewer (下水道), and smell (气味). In noise complaints, the most important word is noise (噪音), followed by sound (声音) and decibel (分贝) also showing high scores. The word with the highest TF-IDF value in the waste complaint is waste (垃圾), which also includes feature words, such as waste cleaning (清理) and ashbin (垃圾桶). The most critical vocabulary in EM radiation complaints consists of converter station (换流站), signal (信号), base station (基站), and EM radiation (电磁辐射). The keywords for light complaints are community (小区) and resident (居民).
In short, this proves that keywords can accurately reflect the differences in environmental complaints and further provide a scientific basis on which for environmental managers to solve environmental problems with accurate entry points. Turning to the similarities of keywords, the terms resident (居民) and community (小区) appear in all type of complaints. The result confirms that the residents and their living environment are of great concern in environmental complaints.

The Sentiment of Environmental Complaints
The box plot ( Figure 4) shows that the mean (air: −0.11; water: −0.10; noise: −0.10; waste: −0.04; EM radiation: −0.15; light: −0.18) and median (air: −0.09; water: −0.08; noise: −0.08; waste: −0.04; EM radiation: −0.10; light: −0.19) of all types of environmental complaint sentiment are both lower than zero, which indicates that the complainants' overall sentiment tendency is negative. Comparing the mean and median of various environmental complaints, electromagnetic radiation and light have the lowest value. The sentiment value distribution of electromagnetic radiation is the most scattered (0.30), followed by light (0.23), which is presumably due to the wide differences between cognitive and individual. There is little difference in the sentiment value distribution of air, water, and noise pollution complaints.
Appl. Sci. 2021, 11, x 9 of 13 environmental complaints, electromagnetic radiation and light have the lowest value. The sentiment value distribution of electromagnetic radiation is the most scattered (0.30), followed by light (0.23), which is presumably due to the wide differences between cognitive and individual. There is little difference in the sentiment value distribution of air, water, and noise pollution complaints.

The Semantic Network of Environmental Complaints
As shown in Table 7, we identified the proportion of clusters and sub-clusters in semantic networks. From the semantic network node, the pollution characteristic is the largest cluster of each network. Except for noise complaints, cluster 3 (complainant) has a higher proportion than cluster 2 (stakeholder). This suggests that individuals making the complainants pay most attention to pollution characteristics, especially the sub-cluster pollution source, followed by their impacts. Stakeholders account for the smallest proportion, which may indicate the least understanding of this cluster of complainants.

The Semantic Network of Environmental Complaints
As shown in Table 7, we identified the proportion of clusters and sub-clusters in semantic networks. From the semantic network node, the pollution characteristic is the largest cluster of each network. Except for noise complaints, cluster 3 (complainant) has a higher proportion than cluster 2 (stakeholder). This suggests that individuals making the complainants pay most attention to pollution characteristics, especially the sub-cluster pollution source, followed by their impacts. Stakeholders account for the smallest proportion, which may indicate the least understanding of this cluster of complainants. Citizens' insufficient knowledge of relevant stakeholders, such as polluting firms and administrations, has also led to complaints that cannot be handled well. According to the official statistics of responses to complaints, 1225 complaints (21.60%) are not within the authority of the Ecology Environment Bureau. Moreover, the complaint contained other stakeholders, including the Water Affairs Bureau, the Urban Management Bureau, and the Education Bureau, which reflects the complexity of urban pollution management. Therefore, urban environmental management needs to strengthen the coordination of multiple departments. Figure 5 reflects the relationships between the keywords of citizens' environmental complaints, from which we observed that the relationships between pollution sources and pollution receptors (PR-PS) are the most important in environmental complaints, such as resident-lampblack (居民-油烟) and resident-exhaust gas (居民-废气) in air complaint; resident-sewage (居民-污水) and residential-oil bath (住宅-油池) in water pollution complaints; noise-resident (噪声-居民) and resident-lampblack (居民-油烟) in noise complaints; waste-resident (垃圾-居民) and garbage station-resident (垃圾站-居民) in waste complaints, residential-converter station (住宅-换流站) in electromagnetic radiation complaints; and LED-resident (LED-居民) in light pollution complaint. From the standpoint of the complainant, pollution sources are a primary concern in environmental complaints. The relationships between the above keywords indicate which pollution should be first supervised and controlled.
In addition to the most concerning relationship between pollution sources, other relationships in environmental complaints also deserve the attention of environmental managers, including those between pollution receptors and pollution behavior (PR-PB), pollution receptors and sensory feature (PR-SF), and pollution receptors and individual health (PS-HL) ( Table 8). As shown in Figure 5, complaints about pollution behavior (PB) mostly regard space and time. The pollution behavior of air complaints and waste complaints emphasizes spatial issues (people-location '人民-选址' and resident-location '居民-选址'), while the pollution behavior of noise complaints and light complaints emphasizes time, such as resident-disturbing (居民-扰民), residential-disturbing (住宅-扰民), and resident-overnight (居民-通宵). The relationship between the pollution receptor and sensory feature (PR-SF) is more prominent in air and waste complaints, mainly for smellrelated terms, such as residential and odors (住宅-气味) and resident and stench (居民-臭 味). Complaints about EM radiation show that the relationship between pollution receptors and individual health (PR-HL) is more prominent. Specifically, citizens are most concerned about the impact of converter stations on safety and health (converter station-physical and mental health 换流站-身心健康). This suggests that supervisors should provide the public with EM radiation-related knowledge.
complaints; noise-resident (噪声-居民) and resident-lampblack (居民-油烟) in noise complaints; waste-resident (垃圾-居民) and garbage station-resident (垃圾站-居民) in waste complaints, residential-converter station ( 住 宅 -换 流 站 ) in electromagnetic radiation complaints; and LED-resident (LED-居民) in light pollution complaint. From the standpoint of the complainant, pollution sources are a primary concern in environmental complaints. The relationships between the above keywords indicate which pollution should be first supervised and controlled. In addition to the most concerning relationship between pollution sources, other relationships in environmental complaints also deserve the attention of environmental managers, including those between pollution receptors and pollution behavior (PR-PB), pollution receptors and sensory feature (PR-SF), and pollution receptors and individual health (PS-HL) ( Table 8). As shown in Figure 5, complaints about pollution behavior (PB) mostly regard space and time. The pollution behavior of air complaints and waste complaints emphasizes spatial issues (people-location '人民-选址' and resident-location '居民-选址'), while the pollution behavior of noise complaints and light complaints emphasizes time, such as resident-disturbing (居民-扰民), residential-disturbing (住宅-扰 民 ), and resident-overnight ( 居 民 -通 宵 ). The relationship between the pollution receptor and sensory feature (PR-SF) is more prominent in air and waste complaints, mainly for smell-related terms, such as residential and odors (住宅-气味) and resident and stench (居民-臭味). Complaints about EM radiation show that the relationship between pollution receptors and individual health (PR-HL) is more prominent. Specifically, citizens are most concerned about the impact of converter stations on safety   The relationship between pollution receptors and pollution behavior (PR-PB) suggests that scientific and integrated site selection is necessary to resolve environmental complaints, including more reasonable site selection of garbage dumps and power telecommunication equipment and stricter construction time control measures. Actions should be taken to address the problems reflected by sensory features (such as stench, mosquitoes, and rats) and to provide the public with environmental and scientific knowledge, especially regarding EM radiation pollution.

Conclusions
In this study, a framework for the textual analysis of Chinese environmental protection complaints was established, and the two-year civil environmental complaint records in Guangzhou city were analyzed using this framework. The conclusions show the following: (1) Civil environmental complaint characteristics can be identified. Keywords of various types of environmental complaints can be automatically and effectively extracted by TF-IDF, such as "lampblack" and "exhaust gas" in air pollution and "LED lights" in light pollution, which provides an accurate entry point for solving urban environmental problems. It also provides technical support for smart city environmental management.
(2) The overall sentiment of environmental complaints is negative. Light pollution complaints are the most negative, and EM radiation complaints have the most fluctuating emotions, which may be caused by differences in citizen perception of EM radiation. (3) The semantic network nodes of the six types of environmental complaints reveal that the public pays the most attention to the pollution sources when complaining but the least attention to stakeholders, which may reduce the efficiency of environmental managers in handling complaints. (4) Besides the Ecology Environment Bureau, stakeholders in environmental complaints involve multiple government departments, including water affairs departments, urban management departments, and other departments. This not only reflects the complexity of environmental pollution but also shows that the issue of environmental complaints is deemed urgent by multiple departments. (5) The citizen semantic network indicates that pollution sources and pollution receptors are paid the most attention. Simultaneously, among different types of complaints, the pollution receptor's relationship with pollution behaviors (site selection, overnight construction), sensory features (stench, dazzle), stakeholders, and individual health are also highlighted by citizens. These relationships suggest that the pollution behavior of pollution sources, sensory features, environmental knowledge of pollution sources, and other details may become a crucial part of pollution management, which will provide more accurate management measures and be beneficial to smart urban environmental governance.
For accurate text mining in further research, a rich corpus of environmental complaints must be established, and adaptable Chinese grammar for complaints needs to be summarized. Named-entity recognition could be considered, which will provide assistance in extracting detailed information about pollution incidents in semantic network analysis. Urban environmental management departments must establish a big data analysis system for environmental complaints based on text mining technology. Only in this way can urban environmental issues be effectively managed.
Author Contributions: Y.J. developed the framework for textual analysis and performed the experiments, derived the models, and analyzed the data. Y.L. was involved in part of the code work. Y.J. wrote the manuscript in consultation with C.L. All authors have read and agreed to the published version of the manuscript.