Evaluation of Environmental Information Disclosure of Listed Companies in China’s Heavy Pollution Industries: A Text Mining-Based Methodology

: Environmental information disclosure (EID) of listed companies is a signiﬁcant and essential reference for assessing their environmental protection commitment. However, the content and form of EID are complex, and previous assessment studies involved manual scoring mainly by the experts in this ﬁeld. It is subjective and has low timeliness. Therefore, this paper proposes an automatic evaluation framework of EID quality based on text mining (TM), including the EID index system’s construction, automatic scoring of environmental information disclosure quality, and EID index calculation. Furthermore, based on the EID of 801 listed companies in China’s heavy pollution industry from 2013 to 2017, case studies are conducted. The case study results show that the overall quality of the EID of listed companies in China’s heavily polluting industries is low, and there is a gap differentiation between the 16 industries. Compared with the subjective manual scoring method, TM evaluation can evaluate the quality of EID more effectively and accurately. It has great potential and can become an essential tool for the sustainable development of society and listed companies. or negative instances using a set of features [62]. The Word2Vec model was used to improve the corpus’ ability regarding semantics and semantic recognition for context. Finally, the core process of the EID quality scoring system was discussed. It searched for keywords or keyword combinations in the company’s environmental information disclosure report. Additionally, calculated the window in variety with the weight value of the EID index system and then evaluated and scored the EID quality of the listed companies during the scoring process, as shown in Figure 5.


Introduction
Sustainable development is a balance between economic growth, environmental issues, and social conditions [1]. Mainly promoted by enterprises and the local government [2]. With the rapid development of the global economy, environmental information disclosure (EID) has aroused widespread concern and promoted the environment's sustainable development [3].
According to the industry-driven growth model, the continuous energy consumption of manufacturing and infrastructure investment negatively impacts the environment [4]. Environmental information disclosure (EID) is an efficient method for promoting the standardization of corporate ecological behavior and a fundamental approach for all social sectors to understand and evaluate corporate environmental behavior [5]. To our knowledge, EID is the third environmental regulatory mode, excluding command and control and market-based environmental regulation [6]. The role of EID in environmental protection has received increased attention across several disciplines in recent years [7]. Therefore, corporate EID quality significantly affects government policymaking and public behavior [8].
Additionally, the quality of corporate EID can affect enterprises' performance in the capital market and their value by changing their social image [9]. EID from enterprises is an efficient and vital method of informing the public. It can allow more people to understand the environmental behavior and sustainable development of enterprises [10]. By reviewing the environmental information from enterprises, the governmental environmental protection departments can better understand the overall situation of their environmental 2 of 23 performance and assess their environmental contribution [11]. Evaluating the quality of environmental information disclosure can also allow the public and investors to determine which enterprises should be invested [12].
There are many methods for evaluating the quality of EID. Previous studies explored content analysis, expert opinion [13], qualitative disclosure [14], and quantitative disclosure methods [15]. The index proposed by Clarkson et al., has been used as an evaluation benchmark in assessing the quality of corporate EID [16]. However, the recent increase in the number of listed companies has increased the amount of information they have disclosed, leading to an upsurge in the EID evaluating workload [17]. The rationality and reliability of traditional analysis methods are continuously challenged due to their working time and cost. Therefore, it is difficult for the results to objectively reflect the quality of corporate EID and disclosure motivation [18]. Evaluating the quality of EID from a listed company is objectively and efficiently challenging [19].
With the rapid development of artificial intelligence, text information mining has successfully been applied in text analysis, improving the measurement accuracy of existing indicators and measuring the disclosure of text content more accurately and comprehensively [20]. To our best knowledge, few previous studies have used text mining (TM) to obtain relevant enterprise EID quality assessment and comprehensive evaluation information based on reports from listed companies. This study proposes an automatic EID quality evaluation framework based on the TM technique, and a flow chart of the analysis procedure is provided in Figure 1.

Literature Review
Sustainability is emerging as an important issue for firms in recent years. EID from listed companies is an essential reference for assessing their environmental protection commitment. Content analysis is the most common methodology used in EID index system studies to analyze economic, social, and environmental details. In 1982, Wiseman et al., added financial indicators of the environmental accounting content to the EDI index system. Many scholars have accepted this index since its development used the economic quantitative information method to develop the EID index system as a supplement to the previous procedure [21]. As global environmental issues have attracted more attention, AI-Tuwaijri et al., introduced environmental regulatory factors to the EID index system [22,23]. According to the literature, scholars have shifted focus on environmental improvement. In this stage, the EID index was mainly divided into qualitative and quantitative. Villiers and Staden added qualitative content to the index [24], and Cho and Patten divided the EID indicators into financial and non-financial indicators for analysis [25].
With the promulgation of the "Sustainable Development Report" (GRI), the quality and quantity of corporate EID must be comprehensively measured. Thus, the EID index The proposed framework involves constructing an EID index system, automatically scoring the quality of EID, calculating the EID index. We collected EID Report text information from 801 listed companies in China's heavily polluting industries from 2013 to 2017. We are using the K-means clustering based on the LDA topic model to mine helpful text information to enrich the EID quality evaluation method. This paper's remainder is structured as follows: Section 2 provides an overview of related research. Section 3 constructs the evaluation index system for the enterprise EID. The quality scoring of the EID, based on the TM technique, is conducted. In Section 4, we present an empirical case study of 801 listed companies in heavily polluting industries, whose environmental data were taken from, among others, the Shanghai and Shenzhen stock markets from 2013 to 2017. Finally, Section 5 presents the research conclusions and proposes future work.

Literature Review
Sustainability is emerging as an important issue for firms in recent years. EID from listed companies is an essential reference for assessing their environmental protection commitment. Content analysis is the most common methodology used in EID index system studies to analyze economic, social, and environmental details. In 1982, Wiseman et al., added financial indicators of the environmental accounting content to the EDI index system. Many scholars have accepted this index since its development used the economic quantitative information method to develop the EID index system as a supplement to the previous procedure [21]. As global environmental issues have attracted more attention, AI-Tuwaijri et al., introduced environmental regulatory factors to the EID index system [22,23]. According to the literature, scholars have shifted focus on environmental improvement. In this stage, the EID index was mainly divided into qualitative and quantitative. Villiers and Staden added qualitative content to the index [24], and Cho and Patten divided the EID indicators into financial and non-financial indicators for analysis [25].
With the promulgation of the "Sustainable Development Report" (GRI), the quality and quantity of corporate EID must be comprehensively measured. Thus, the EID index system has become more comprehensive and has been adopted by scholars. Halme and Huse used an index of 0-1 to assign the corporate EID status. Neu et al., and Patten used words, sentences, and EID length to calculate the index and quantitative index scoring [26,27]. Aerts assigned different weights and scores to other information items based on their quality in the Corporate Environmental Disclosure Report [28]. Hasseldine combined the quality and quantity index scores to study corporate EID and environmental reputation [29]. Beck et al. proposed measuring the environmental disclosure index consisting of coverage, quality, and quantity [30]. Rupley et al. used a more complex index scoring approach to evaluate the relationship between corporate governance, media, and EID [31].
However, the existing absence of a regulatory framework for the EID's substance has culminated in a largely subjective disclosure, and the standard has been inconsistent. At present, there is no formal documentation on the quality assessment of EID, and there is no standard on the establishment and selection of evaluation indicators and methods. The EID was used as a variable to analyze its influence factors to find a way to enhance EID quality [32]. However, their liability of EID output significantly affects the outcomes of the studies mentioned above. Reinforce oversight and improve the efficiency of EIDs or provide a research base for further research. It is essential to develop a framework of science and realistic evaluations.
However, as Grey and Milne suggested, there is no superior research approach, method, or technique [33]. The most common technique using analyze economic, social, and environmental information is content analysis within business studies. Neuman stated that content analysis is "a technique for gathering and analyzing the content of the text. The content refers to words, meanings, pictures, symbols, ideas, themes or any message that can be communicated" [34]. Content is coded into various categories or concepts depending on selected criteria, and coding can be performed manually or using computeraided technologies.
Therefore, TM has been widely used in the field of information analysis. Text mining is a computer-assisted technique equipped with the capability to extract information and trends from large amounts of textual data. Feldman and Sanger are giving an overview of the main issues discussed in the reports [35]. It is based on data mining techniques, machine learning, and natural language processing applied to the text. It strongly relies on computer programs and algorithms; thus, it is supposed to overcome the problems associated with content analysis's reliability [36].
Recently, TM has also been used to analyze trends and patterns in sustainability reports [37] and determine which aspects are the most public in company reports [38].
Yang and Lee proposed a TM method based on organization mapping to extract image semantics from an environmental text [39]. Modapothala and Issac used Bayesian Sustainability 2021, 13, 5415 4 of 23 estimation to assess the relationships between ecological and social performance indicators and listed companies [40]. Huang et al., used TM to analyze financial disclosure in analysts' earnings forecasts [41]. Riffe et al., investigated the negative externalities of corporate emergencies' widespread media coverage using text analysis [42]. Bonzanini and Marco used Python to capture data from new blogs to analyze public environmental services [43]. Lee et al. studied spatial environmental information environmental trends using quantitative analysis and TM techniques [44], and Maeda et al. used TM to determine which managerial regulations could serve as incentives to motivate enterprises to consider the environment [45]. Irina et al., used text analysis to quantitatively analyze research trends in the environmental field [46]. Park and Kremer studied the classification of ecological sustainability indicators based on a TM approach [47]. Rabiei used TM to elucidate knowledge gaps and priorities in Iranian environmental science [48], while Villeneuve et al. explored indoor environmental quality based on Airbnb customer reviews using TM [49]. Wang et al. assessed the environmental performance in tourist areas and used TM of online news to explore the influencing factors [50].
Although text mining technology is widely used, the current EID index system and research perspective are limited to artificial subjective judgment. Machine learning-related content is rarely used, and there is a lack of research that has used text similarity, text clustering, and semantic text analysis to evaluate EID. Therefore, it is difficult for the study results on corporate EID to fully reflect quality and disclosure motivation. Although some researchers have used TM techniques for environmental information quality assessment, there is much room for refinement. Therefore, in this study, we attempted to construct a comprehensive EID evaluation system based on TM techniques and employed machine scoring to improve the listed companies' EID quality.

Methodology
The methodology for this study involved three stages. The first stage consisted of the automatic EID quality evaluation framework, including constructing the EID index system. The second stage consisted of the putout indicator weight and automated scoring of EID quality. Finally, the calculation of the EID index is conducted; the framework is presented in Figure 2. Before text mining, we need to obtain the text data of listed companies' environmen- Figure 2. Automatic EID quality evaluation framework. Before text mining, we need to obtain the text data of listed companies' environmental information disclosure in the heavy pollution industry. The Chinese text data acquisition method uses crawler code to crawl the environmental information disclosure text data.

Handling Chinese Encoding Issues
Python does not support Unicode processing. We need to follow Python's principle for Chinese text preprocessing to use utf-8 to store data and use Chinese encoding such as GBK.

Chinese Word Separation
There are many commonly used Chinese word splitting software. This study uses JIEBA word splitting. For example, "pip install Jieba" based on Python can be completed.

Introduction of Stop Words
There are many invalid words in Chinese text, and some punctuation marks that we do not want to introduce in Chinese text analysis need to be removed; therefore, these words are deactivated.

EID Quality System Construction
The purpose of assessing EID quality is to determine the listed companies' reliability, comprehensiveness, and legal and regulatory compliance. This study used the K-means clustering based on the LDA topic model algorithm to construct the EID index system for mining objective evaluation information through EID topic clustering analysis.
A flowchart of the construction of the EID index system is presented in Figure 3.  We accessed the EID text data through crawling and keyword searching. Following the text preprocessing stage, text preprocessing methods, including manual screening, text segmentation, and removing "stop word," were used. The co-occurrence word matrix was then obtained based on the high-frequency words. Finally, we used the K-means clustering based on the LDA topic model algorithm to cluster the EID topics based on the cooccurrence word matrix and processing steps.
Prihatini et al. used K-means clustering through LDA based on this similarity matrix and finally found good clustering results for this method [51]. Alhawarat et al., clustered the Arabic collection by combining the LDA topic model and K-means clustering [52]. The results confirmed that this method could significantly improve the quality of clustering. Bui et al., found that combining the LDA topic model and K-means clustering worked well for clustering [53]. In summary, there have been numerous studies showing that Kmeans clustering based on LDA topic models is better than traditional K-means clustering We accessed the EID text data through crawling and keyword searching. Following the text preprocessing stage, text preprocessing methods, including manual screening, text segmentation, and removing "stop word," were used. The co-occurrence word matrix was then obtained based on the high-frequency words. Finally, we used the K-means clustering based on the LDA topic model algorithm to cluster the EID topics based on the co-occurrence word matrix and processing steps.
Prihatini et al. used K-means clustering through LDA based on this similarity matrix and finally found good clustering results for this method [51]. Alhawarat et al., clustered the Arabic collection by combining the LDA topic model and K-means clustering [52]. The results confirmed that this method could significantly improve the quality of clustering.
Bui et al., found that combining the LDA topic model and K-means clustering worked well for clustering [53]. In summary, there have been numerous studies showing that Kmeans clustering based on LDA topic models is better than traditional K-means clustering methods. Therefore, this paper uses the K-means clustering method based on the LDA topic model to improve the text mining of samples and obtain a more accurate topic hierarchical classification.
The LDA topic model is used to obtain a complete analysis of text collections and achieves better text mining results; the more significant the amount of text collection is, the larger the topics. LDA-based topic evolution analysis involves evaluating the model's generalization ability to measure the model's predictive power for unobserved data. We use the accepted metric of perplexity to measure the generalization ability of the model. Therefore, the smaller the perplexity is, the better the model generalization ability. Therefore, when the number of topics varies, the perplexity of the model also varies. The optimal number of topics can be determined by calculating the model's perplexity with different topics. The formula is shown in Equation (1) as follows: where N d denotes the number of feature words of the D, and p(w d ) is the probability of generating a document, which is calculated in Equation (2) as follows: where P(z i |d) denotes the i topic's probability in the d text, and P(w|z i ) denotes the probability of distributing the word of the distribution probability of the sink w.

EID Index Layer Weight Calculation
The EID index calculation was based on the EID index system and consisted of keyword expansion and word similarity calculation. We first used the EID system (essential seed word list) and EID corpus to obtain an extended word table based on the Word2Vec model. The text similarity of symbolic words was then calculated using the term frequencyinverse document frequency. The TF-IDF weight method is the most widely used weight calculation method in text processing [54]. TF-IDF model and the computed text similarity gave the weight of each EID index layer.

Keyword Expansion
We obtained the critical seed word list for keyword expansion based on the EID evaluation index system. We combined it with the EID corpus to obtain a hybrid corpus. We then used the Word2Vec model to train extended words and obtained a comprehensive wordlist based on the hybrid corpus. The Word2Vec model was prepared as follows: First, we converted the training data into the format of the Word2Vec model. In our work, we first input large-scale text corpus to Word2Vec to produce word vectors. Using the structure of the Word2Vec, we can train and expand more exclusive vocabulary in the field of environmental information.
Second, the skip-gram algorithm expanded crucial seed words based on the complete corpus' environmental information. Through feeding the text corpus into one learning model, Word2Vec finally generates the word vectors. Word2Vec is not a single algorithm, but it includes two learning models-the continuous bag of words (CBOW) and skipgram [55]. The skip-gram model's training objective is to find word representations helpful in predicting the surrounding words in a sentence or a document. Skip-gram indicates the context given the word [56]. The vectors of all expanded words were obtained as the Sustainability 2021, 13, 5415 7 of 23 output after completing the training. Finally, the first 15 expanded words in each index layer and closest key seed word vector were selected according to the EID index system to form an expanded vocabulary.

Extended Word Similarity Calculation Based on the TF-IDF Model
The TF-IDF model is a keyword extraction method based on the word bag algorithm, widely used in TM to evaluate the word's importance to the text ( Figure 4). model, Word2Vec finally generates the word vectors. Word2Vec is not a single al but it includes two learning models-the continuous bag of words (CBOW) a gram [55]. The skip-gram model's training objective is to find word representatio ful in predicting the surrounding words in a sentence or a document. Skip-gram the context given the word [56]. The vectors of all expanded words were obtain output after completing the training. Finally, the first 15 expanded words in ea layer and closest key seed word vector were selected according to the EID inde to form an expanded vocabulary.

Extended Word Similarity Calculation Based on the TF-IDF Model
The TF-IDF model is a keyword extraction method based on the word bag al widely used in TM to evaluate the word's importance to the text ( Figure 4).  TF-IDF is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown demonstrating the usefulness of the model [57]. The TF-IDF model is a keyword extraction method based on the word bag algorithm. It is widely used in TM to evaluate its importance to the text [58]. This study used the TF-IDF model to extract keywords from the studied text [59]. The similarity of expansion words was then calculated according to the TF-IDF model and the weight of each index layer was obtained [60].

Automatic Scoring of EID Quality
In this study, Python was used to read the text. The companies EID index system evaluation criteria were loaded into the operating system. Therefore, the model was read, and the corresponding string input in the system was obtained. Platform plurality was also used to train the word vectors and obtain an expanded corpus of EID. We used Python packages in the text mining process, e.g., Wordcount, Jieba, and Gensim collections, to clean and filter these reports, extract content related to environmental information, and form sample files after multiple rounds of compounding. Further, we performed text preprocessing, word segmentation, word frequency statistics operations on the sample files and used Gensim of the word2vec model to train word vectors.
Interactive technology was used to observe the expansion of the keyword language.
There have been various previous works on keyword extraction, primarily on different text domains. The TF-IDF-based selection has been widely used. It is computationally efficient and performs reasonably well [61]. Keyword extraction has also been treated as a supervised learning problem. A classifier is used to classify candidate words into positive or negative instances using a set of features [62]. The Word2Vec model was used to improve the corpus' ability regarding semantics and semantic recognition for context. Finally, the core process of the EID quality scoring system was discussed. It searched for keywords or keyword combinations in the company's environmental information disclosure report. Additionally, calculated the window in variety with the weight value of the EID index system and then evaluated and scored the EID quality of the listed companies during the scoring process, as shown in Figure 5.
Interactive technology was used to observe the expansion of the keyword language. There have been various previous works on keyword extraction, primarily on different text domains. The TF-IDF-based selection has been widely used. It is computationally efficient and performs reasonably well [61]. Keyword extraction has also been treated as a supervised learning problem. A classifier is used to classify candidate words into positive or negative instances using a set of features [62]. The Word2Vec model was used to improve the corpus' ability regarding semantics and semantic recognition for context. Finally, the core process of the EID quality scoring system was discussed. It searched for keywords or keyword combinations in the company's environmental information disclosure report. Additionally, calculated the window in variety with the weight value of the EID index system and then evaluated and scored the EID quality of the listed companies during the scoring process, as shown in Figure 5. The EID quality index was calculated based on the variable setting method and EID index system. The parameter values were binary and set to 0 or 1 [63]. We supposed a keyword with a corresponding meaning appeared in the vocabulary window of the text of EID. Specifically, if there were content consistent with the sub variable in the text for environmental information, this sub variable parameter would be "1"; otherwise, it would be "0." An EID scoring was constructed, as shown in Equations (3)- (6).
Firstly, we put the main variables and sub-variables into the EID index system table. The EID quality index was calculated based on the variable setting method and EID index system. The parameter values were binary and set to 0 or 1 [63]. We supposed a keyword with a corresponding meaning appeared in the vocabulary window of the text of EID. Specifically, if there were content consistent with the sub variable in the text for environmental information, this sub variable parameter would be "1"; otherwise, it would be "0." An EID scoring was constructed, as shown in Equations (3)- (6).
Firstly, we put the main variables and sub-variables into the EID index system table.
Secondly, we tabulated the sub-variable sub-variable from the same main variable through text mining and Equations (3) and (4). (3) Thirdly, calculating the EDI score of the environmental information disclosure quality to be evaluated by Equations (5) and (6), equal to the sum of all main variables.
Sustainability 2021, 13, 5415 9 of 23 The EID quality was calculated based on the variable setting method and the EID index system, where "i" is the main variable, i = 1,2,3,· · · ,m; and j is the sub variable, j = 1, 2,· · · ,n. The number of sub-variables in each main variable is unlimited.
Therefore, the value of the EID score reflects different levels of environmental information disclosure quality consistency. To give the same weight to all sub-variables, it is necessary to use a binary system. The binary system (0,1) helps to maintain a balance among all variables [64]. We supposed a keyword with a corresponding meaning appeared in the vocabulary window of the EID text and verified that the content is consistent with the sub variable in the text.

Sample and Data Source
The study focuses on listed firms in heavy pollution industries in China.

Keyword Extraction
In this study, Python and Jieba third-party databases were used for word segmentation and frequency statistics and building a corpus in EID. First, the sample report dataset was cleaned, screened, related to the environmental information content, and extracted as the sample file. It is complete information from the environmental perspective.
Text segmentation, stop word deletion, and word frequency statistics were then conducted on the sample files, and 18,691 high-frequency words in the field of environmental information were obtained. According to the literature [65], the collected glossary was sent to five experts in relevant fields, who judged and scored whether the environmental information was appropriate. After three rounds of screening, the exclusive vocabularies for the area of environmental information were obtained. Finally, using the words were sorted from large to trim based on their frequency, and the top 60 keywords were selected, as shown in Table 1.

Topic Clustering Based on K-Means Clustering Based on LDA Topic Model
According to the keyword co-occurrence matrix, the keywords were divided into different categories. Whether the high-frequency keyword threshold is reasonable or not will significantly impact the co-word analysis results. This article adopts Pareto's law selection method to set the point. In Table 3. Based on the 60 high-frequency words, a threshold value of 20 was established, and the keywords in the co-word matrix were clustered using the K-means clustering based on the LDA topic model algorithm; the threshold was correctly defined.
We calculated the model's perplexity under each number of topics and plotted the line graph in Figure 6. There is a clear inflection point in the confusion curve from the graph when the number of topics is nine. Additionally, the curve flattens out, indicating that increasing the number of topics again does not significantly reduce the confusion. Therefore, the optimal number of topics is determined to be nine. value of 20 was established, and the keywords in the co-word matrix were clustered using the K-means clustering based on the LDA topic model algorithm; the threshold was correctly defined. We calculated the model's perplexity under each number of topics and plotted the line graph in Figure 6. There is a clear inflection point in the confusion curve from the graph when the number of topics is nine. Additionally, the curve flattens out, indicating that increasing the number of topics again does not significantly reduce the confusion. Therefore, the optimal number of topics is determined to be nine. The document-potential-topic model output by the LDA topic model is used as samples, and the category is set to nine. In turn, K-means clustering is performed, and the effect of clustering K-means based on the LDA topic model is shown in Figure 6. As shown in Figure 7, the samples are better divided into nine categories, and each category has a clear demarcation line. The document-potential-topic model output by the LDA topic model is used as samples, and the category is set to nine. In turn, K-means clustering is performed, and the effect of clustering K-means based on the LDA topic model is shown in Figure 6. As shown in Figure 7, the samples are better divided into nine categories, and each category has a clear demarcation line. Table 4 shows the text topic clustering results based on the K-means clustering based on the LDA topic model algorithm. Cluster1 contained keywords such as policies, regulations, and departments; therefore, it could be summarized as "corporate governance structure"; Cluster2 included consumption and energy; thus, this category was classified as "energy consumption environmental liability information"; Cluster3 contained wastewater and discharge, and was classified as "environmental pollution discharge information"; Cluster4 had waste, and was classified as "waste disposal"; Cluster5 included R&D and environmental protection, and was classified as "environmental governance expenditure"; Cluster6 included ecological protection and fines, and was classified as "environmental penalty expenses"; Cluster7 contained sewage discharge and greening as keywords, and was classified as "environmental greening and sewage discharge expenditure information"; Cluster8 included tax and relief, and was classified as "environmental tax relief"; finally, Cluster9 included awards and revenue as keywords, and the category was summarized as "income from environmental incentives." Sustainability 2021, 13, x FOR PEER REVIEW 12 of 24  Table 4 shows the text topic clustering results based on the K-means clustering based on the LDA topic model algorithm. Cluster1 contained keywords such as policies, regulations, and departments; therefore, it could be summarized as "corporate governance structure"; Cluster2 included consumption and energy; thus, this category was classified as "energy consumption environmental liability information"; Cluster3 contained wastewater and discharge, and was classified as "environmental pollution discharge information"; Cluster4 had waste, and was classified as "waste disposal"; Cluster5 included R&D and environmental protection, and was classified as "environmental governance expenditure"; Cluster6 included ecological protection and fines, and was classified as "environmental penalty expenses"; Cluster7 contained sewage discharge and greening as keywords, and was classified as "environmental greening and sewage discharge expenditure information"; Cluster8 included tax and relief, and was classified as "environmental tax relief"; finally, Cluster9 included awards and revenue as keywords, and the category was summarized as "income from environmental incentives."  Table 5 presents the K-means clustering based on the LDA topic model-based EID index system. According to the clustering results, a similar EID index system hierarchy was constructed from the EID index system with 9 subject-level structures and 24 index   Table 5 presents the K-means clustering based on the LDA topic model-based EID index system. According to the clustering results, a similar EID index system hierarchy was constructed from the EID index system with 9 subject-level structures and 24 index levels. The companies' environmental disclosure index was measured based on this system, which could be used to assess whether the enterprise EID quality was consistent with regulatory agencies' requirements. Shows the specific energy consumption of the company. The company's total energy consumption can directly reflect its actual performance in energy conservation. Relevant information can be used to assess the managers' expectations for environmental protection contributions

EID Index Layer Weight Calculation
Based on the Word2Vec model, we used the skip-gram algorithm to train extended words and obtain a comprehensive word list. The text-similarity of the broad terms was calculated by the TF-IDF model for the EID index system, as shown in Table 6.

Automatic Scoring of EID Quality
Python was used to read the samples' enterprise environmental information content, and machine scoring was conducted according to the experimental process. The massive pollution industry's EID index listed its distribution across industries, as shown in Table 7 and Figure 8.

Constructing the EID-Surface
The purpose of constructing the EID-Surface is to represent all results in the EID-Matrix graphically. The EID-Surface vertical axes show the strengths and weaknesses within EID scores (in Appendix A, Table A1) on a multidimensional coordinate space. Horizontal axes in the EID-Surface representative involve 24 sub-variables distributed in 9 main variables. The construction of the EID-Surface is based on the EID-Matrix results. The EID-Matrix is a three-by-three matrix containing the individual results of all 9 main

Constructing the EID-Surface
The purpose of constructing the EID-Surface is to represent all results in the EID-Matrix graphically. The EID-Surface vertical axes show the strengths and weaknesses within EID scores (in Appendix A, Table A1) on a multidimensional coordinate space. Horizontal axes in the EID-Surface representative involve 24 sub-variables distributed in 9 main variables. The construction of the EID-Surface is based on the EID-Matrix results. The EID-Matrix is a three-by-three matrix containing the individual results of all 9 main variables and 24 sub-variables.
EID exponent's visual processing exponent by EID surface form shows the result of parallelism intuitively through matrix transformation of nine main variables designed in this paper. Considering the matrix's symmetry and the EID surface's balance, form a matrix of order three by three. The area of the EID area is calculated by using Equation (7).
This study only presents the EID surface charts for the industries with the top and bottom two EID scores. From the surface graph, it is more intuitive to see the scores on each indicator layer of industry (in Appendix A, Table A1). Figure 9a,b represents C31, the ferrous metal smelting, and rolling processing industry EID-Surface, and B07 oil and gas extraction industry EID-Surface. These two industries have the highest scores. Figure 10a,b represents C17 textile industry EID-Surface and C27 pharmaceutical manufacturing industry EID-Surface. These two industries have the lowest scores.  It can be seen that the environmental disclosure information of the ferrous metal smelting and rolling processing and oil and gas extraction industry was very comprehensive and detailed, and the relevant environmental information was disclosed in the annual and social responsibility reports. The textile, pharmaceutical, paper, leather, fur, feather industries accounted for the low EID quality score for their most proportion.
Under the nine leading indicators, 98% of companies disclosed the corporate governance structure of "environmental policies, guidelines, and concepts," which is the highest disclosure rate. The construction, investment, and operation costs of environmental protection facilities and the "three wastes" treatment approach were second only to the indicators, which all exceeded 80%. The disclosure rate of environmental responsibility information indicators, such as "emissions and emission reductions" and "emission reduction targets," was below 30%. Generally, the sample enterprises' disclosure of environmental information needs to be improved since many vital indicators have not been sufficiently disclosed. It can be seen that the environmental disclosure information of the ferrous metal smelting and rolling processing and oil and gas extraction industry was very comprehensive and detailed, and the relevant environmental information was disclosed in the annual and social responsibility reports. The textile, pharmaceutical, paper, leather, fur, feather industries accounted for the low EID quality score for their most proportion.
Under the nine leading indicators, 98% of companies disclosed the corporate governance structure of "environmental policies, guidelines, and concepts," which is the highest disclosure rate. The construction, investment, and operation costs of environmental protection facilities and the "three wastes" treatment approach were second only to the indicators, which all exceeded 80%. The disclosure rate of environmental responsibility information indicators, such as "emissions and emission reductions" and "emission reduction targets," was below 30%. Generally, the sample enterprises' disclosure of environmental information needs to be improved since many vital indicators have not been sufficiently disclosed.

The Calculation of the EID Index
To compare the degree of environmental information disclosure of 16 in heavy pollution industries better, we calculated the maximum possible score for each sample company's environmental information disclosure as 24, evaluated by Equation (8) (corporate governance structure scored 7, environmental responsibility scored 1, environmental pollution discharge scored 5, waste disposal scored 2, environmental governance expenditure scored 3, acceptable environmental expenditure scored 1, environmental protection expenditure scored 1, environmental protection tax relief scored 1, environmental rewards scored 3). EID quality evaluation indices for China's heavily polluting industries in 2013-2017 are listed in Table 8.
EID-Index = EID/24 × 100% (8) As shown in Table 8, the listed companies' average EID quality evaluation index in China's heavily polluting industries indicated that the overall EID quality of the listed companies is insufficient. There were also differences in the EID quality evaluation index between the 16 heavily polluting industries. Listed companies in the non-ferrous metal smelting industry, oil and gas extraction industry reported the highest quality corporate environmental information, which was significantly higher than that of other industries. As the focus of environmental protection, these enterprises are facing significant pressure from the government and public. Therefore, they tend to focus on improving the EID quality.
Additionally, most of these enterprises have a large production scale, environmental solid management capabilities, and excellent EID aids in strengthening their market competitiveness. Some listed companies in less-polluting industries, such as pharmaceutical manufacturing, exhibited poor EID performance. Additionally, most listed companies in high-polluting industries are gradually improving the quality of their environmental information. The EID index is related to the enterprises' commitments, capacity building, business performance, and external environmental regulation changes.
The disclosure score ratio of the "corporate governance structure" indicator was highest. This information was disclosed the most. The ratio of "environmental responsibility information" was lowest, significantly lower than those of the other indicators. The listed companies in China's heavily polluting industries were more active and comprehensive in disclosing their environmental protection requirements, environmental protection concepts, and additional relevant information. They tended to respond proactively to China's rising enthusiasm toward environmental protection in recent years, formulate environmental protection policies, and have been committed to establishing the company's environmental protection image. Listed companies in China's heavily polluting industries rarely disclose their environmental requirements due to a lack of environmental constraints. It should be noted that the "environmental pollution emissions" indicator score was very high because listed companies in China's heavily polluting industries have actively responded to the latest EID policy requirements announced by the China Securities Regulatory Commission and promoted various "three wastes." The pollutant discharge data have also been actively disclosed. The quality and quality of their EID and public disclosure information was analyzed from the 2015 and 2016 annual, corporate social responsibility, sustainable development, and environmental reports, and the index scores were ranked. Thus, this report was authoritative in the industry.
The evaluation results were further compared based on machine evaluation and artificial subjective evaluation to verify the results' reliability. The automatic scoring approach was used to reassess the enterprise EID scores for the same sample of 172 listed companies in 2015 and 2016. Given the difference between the respective index system's construction and the score, the scores cannot be directly compared.
Therefore, rank and nonparametric tests were conducted to compare the same sample of companies' rankings [70]. The rank-sum test is only used to reach particular values in a given parameter distribution.
Sort the scoring results according to the company's stock code to obtain two sets of random sequences and use the correlation coefficient between the two sets of sequences to measure the consistency of the two scoring methods. The correlation coefficient is, where X i , x i , y i is manual and machine scoring, and x, y Are the sample mean. The new method of evaluating the quality of EID based on machine scoring provided reliable results of the quality of corporate EID. The corresponding results of the rank and rank test are shown in Table 9. For a significance of 0.01, the respective values corresponding to the annual P-statistics were always more significant than the critical importance of 0.05.
Therefore, there was no significant difference between the two approaches' evaluation results, and the new method of EID quality evaluation based on machine scoring is accurate. Additionally, the original form can save more time and achieve better accuracy than the traditional subjective evaluation method.

Conclusions
In conclusion, key findings suggest that it helps investigate firms' strategic sustainability intentions based on the critical issues identified in their EID quality. This paper aims to propose a new method to evaluate the subjectivity and objectivity of environmental information disclosure based on text mining tools, enrich the research content, and open up more text data sources. The research content and information sources of text big data in environmental management can be further refined and enriched.
First, the TM method was used to improve the current EID index system. The EID index system's content was adjusted dynamically, and complete corpora and scoring systems of the EID quality were also constructed. The EID quality scoring system could generate indices at all levels and corresponding indicator-layer weights under various classification dimensions according to different scoring standards required. The accuracy of the index scoring standard was significantly improved through the proposed method. This analysis no longer requires simple manual counting and judging or the simplified mathematical conversion method. The EID index scoring value extraction was convenient and quick, and the accuracy was significantly improved.
Second, calculating the comprehensive evaluation index of enterprise EID was enriched based on the K-means clustering based on the LDA topic model algorithm and Word2Vec model. The more comprehensive exploration and application of the EID index calculation via the proposed method than the current artificial subjective index calculation model ensures comprehensive, objective, and valuable information disclosure and overcomes the limitations of the previous synthetic personal calculation method. The combination of multiangle measurements based on TM could more accurately and objectively reflect the content and extent of environmental information disclosed in enterprise reports. The resulting EID index was more representative and convincing. The proposed method could comprehensively represent an enterprise's EID status.
Third, machine scoring resolves the inefficiency and limitation of human subjective judgment regarding corporate EID quality. According to the machine scoring results, China's listed companies in heavily polluting industries exhibited moderate EID performance. Although most companies are striving to improve, only some high energyconsuming industries achieved a relatively high EID score. The environmental protection requirements of listed companies in China's heavily polluting industries are rarely disclosed, and the Chinese government lacks sufficient constraints regarding corporate environmental liability requirements.
Our work's main limitation is mining the quantity and quality of company environment information disclosure from text content. Future research should try to apply more advanced text mining instruments closer to LSA, LDA, and other implicit topic models. Human language leads to more similar results to those obtained from the content analysis.
This study offers several possibilities for further work. Machine learning technology can be improved to better mine and assess EID performance and explore the internal and external mechanisms. Additionally, the environmental information disclosure corpus can be further improved and be better applied to evaluation methods. This study only conducted empirical research based on heavy pollution industry-related enterprises. Therefore, to analyze the accuracy and reliability of the way, we should further expand the practical research samples' scope. Finally, the quality of enterprises' environmental information disclosure will significantly affect the government's formulation of relevant policies and public behavior. Therefore, it is also necessary to study EID quality measurement's economic impacts on enterprises and policy choices.
In 2020, the Securities and Futures Commission of China revised the Guidelines on Investor Relations Management for Listed Companies, including communication on "information on environmental protection, social responsibility, and corporate governance of companies." Our team will continue to follow the development of this ESG practice in China. ESG related research will also become the future research direction of our team.