Understanding Corporate Sustainability Disclosures from the Securities Exchange Commission Filings

: As sustainability becomes fundamental to companies, voluntary and mandatory disclosures or corporate sustainability practices have become a key source of information for various stakeholders, including regulatory bodies, environmental watchdogs, nonproﬁts and NGOs, in-vestors, shareholders, and the public at large. Understanding sustainability practices by analyzing a large volume of disclosures poses major challenges, given that the information is mostly in the form of text. Applying machine learning and text analytic methods, we analyzed approximately 25,428 disclosure reports for the period of 2011 to 2020, extracted from the Securities and Exchange Commission (SEC) ﬁlings and made available at the Ceres website via application programming interfaces (APIs). Our study identiﬁed six industry clusters from the K-means and six main topics from the latent Dirichlet allocation (LDA) method that related to the disclosure of climate-change-related environmental concerns. Both methods produced overlapping results that further reinforce and enhance our understanding of climate-change-related disclosure at various levels, such as sector, industry, and topic. Our analysis shows that companies are concerned primarily with the topics of gas emission, carbon risk, climate change, loss and damage, renewable energy, and ﬁnancial impact when disclosing climate-change-related issues to the government. The study has implications for corporate sustainability practices, the communication and dissemination of such practices to stakeholders at large and furthering our understanding of sustainability in general.


Introduction
Sustainability is a universal concern [1,2]. Environment pollution inequity, injustice, and poverty continue to be faced by millions of people across the world [3]. While sustainability is gaining attention at the macro level (e.g., at the United Nations SDG, regional, and country levels), large companies are still focused on maximizing wealth based on shareholder theory, rather than taking the negative impacts of climate change into consideration [4]. Heavy industrial development is still a threat to sustainable development in many developing countries [5]. However, on the brighter side, corporate sustainability is gaining increased attention due to pressure from various stakeholders [6]. While corporate sustainability has been variously defined, the idea generally includes the pledge of a business to contribute to sustainable execution of business operations, working with multiple stakeholders, such as customers, suppliers, employees and their families, the government, nonprofits, NGOs, watchdogs, the community in which it does business, and the greater society, to enhance the quality of living [6]. An outgrowth of corporate engagement is sustainability is that companies are facing increasing tension on two issues-the demand for more disclosure and transparency and the need to adopt good corporate governance practices [7]. This stakeholder-and activist-fueled ultimatum for business answerability and clarity [8,9] has Sustainability 2023, 15, 4134 4 of 32 mission on Environment and Development, 1987). Though this definition emphasizes the environment, the report is also aware that "development cannot be said to be sustainable if it is not equitable, or if it does not meet the pressing needs of the majority of the inhabitants of the globe." Not only must sustainability engage the current generation, but it must also be mindful of the needs of future generations. This view of sustainability has been accepted universally [26,49]. Though the Brundtland report provides a working definition, the expression "corporate sustainability" generally refers to companies [49]. While some researchers have focused on the environmental components of sustainability, others have highlighted the social dimensions. Yet others have adopted a more holistic approach integrating environmental, social, and economic issues, without any one issue dominating [49,50]. Corporate sustainability is generally perceived to be the epitome of economic function, environmental performance, and social responsibility [15,51]. Focusing on environmental performance, the related practices address the consequences of the expending of natural resources and the production of gaseous emissions. Generally, it would be expected that the two be at the minimum levels possible to safeguard the overall ecosystem [15]. It follows then that positive environmental practices are associated with diminishing environmental destruction by preserving natural resources, including energy [52], and efficient waste disposal, for example, via recycling [53]. The typical metrics of measuring the environment are air emissions, biodiversity, energy usage, natural resource loss, solid waste, transportation, and water use and discharge [51,54]. The social practice has the goal of improving the quality of life of local communities [15,52]. Economic sustainability practices support long-term liquidity and healthy returns on investments to the stakeholders [15]. However, the latter two are not the targets of this study. Companies are also motivated to immerse themselves in sustainability due to regulation [51]. In addition, they are susceptible to encouragement from customers, employees, and suppliers [55], as well as stress from management, since long-term investments in sustainability may indeed promote financial performance [56]. All in all, it is settled that companies the world over are engaged in CSR/ESG as indicators of sustainability. The key challenge then is, how do these companies' practices become more visible? The answer lies in the timely and effective disclosure of their sustainability practices, namely filings and reports [49].

Sustainability Disclosures and Reporting
Sustainability and its disclosure go together. To wit, companies share their sustainabilityrelated information to enable transparency and thereby create a positive image about themselves. Numerous theories attempt to explain the dynamics of sustainability disclosures. Applying legitimacy and stakeholder theories, the disclosure of financial, environmental, and social information (i.e., corporate sustainability disclosure-CSD) is inclusive in the ongoing conversation between a company and its stakeholders. A disclosure facilitates relevant information flow regarding a company's sustainability-related initiative that assists in the company's activities that help validate its attitude, educate the public, and transform expectations [57,58]. The prospect of the enabling the possibility of environmental reporting is fueling increased awareness of the promise of sustainability [59]. Researchers have therefore emphasized the imperative to inquire further into the various dimensions and nuances that can leverage disclosure practices [59,60]. As indicated before, the literature has confirmed the companies' use of sustainability disclosure to manage their reputation risk [59,61]. Corporate sustainability disclosure has also been used as a channel to connect with the various stakeholders. It implies the company may communicate with different stakeholders, including regulatory agencies, nonprofits and NGOs, activist shareholders, institutional investors, and others [62][63][64][65][66]. Researchers, realizing the value of these disclosures, have quested to extract the nuggets of insight from the companies' dissemination via various publication outlets. While the disclosures may provide biased views, namely those of management, nevertheless, one gets at least an inkling of the company's practices [26,67]. Numerous researchers have studied various aspects of corporate sustainability reports (disclosures) by adopting different  [3,68]. Sobhani et al. (2011), for example, identified the performance indicators disclosed in corporate sustainability reports [3,65]. Managers can use sustainability reports to obtain accurate knowledge of a company's sustainability efforts, rather than relying on third parties. Similarly, researchers can analyze sustainability reports to test hypotheses entailing sustainability reporting and performance. Studies that have conducted content analysis of sustainability reports have investigated the evolution of report content quality [25,65] or trends in sustainability reporting in industries [69,70] or countries. Tate et al. (2010) analyzed sustainability reports using content analysis with automated software to uncover supply chain sustainability themes, which are compared against companies' geographic location and revenues [71]. Montabon et al. (2007) also conducted content analysis of corporate sustainability reports based on environmental management practices identified in the literature [72]. However, there were limitations as to how the analysis was carried out [61]. Several previous studies have measured companies' level of disclosure through simply the number of relevant words, sentences, or pages in sustainability reports on different themes [69,70]. The drawback of counting space allocations for certain words or themes is that such an approach fails to capture the information in the reports [65]. In a similar vein, some researchers have used computeraided text analysis to uncover supply chain sustainability (e.g., [71]). However, many sustainability reports present information graphically, which limits the use of computer-aided text analysis [65]. In addition, sustainability reporting in Australia, for example, is largely voluntary and most reports/disclosures are not subject to the process of assurance. Overall, as researchers continue to use several qualitative techniques, due to a paucity of quantitative data that would be more amenable to analysis, machine learning and text analytics have emerged as the leading methods with immense potential in this domain [18,.

Machine-Learning-Based Text Analytics
The increasing number of sustainability reports and growing interest among the diverse stakeholder groups have created new challenges for policymakers and researchers in measuring the effectiveness and exploring the information of these reports, given the diversity of the reports' content and structures [30]. Machine-learning-based text analytics has emerged as a viable technique to analyze large corpuses of unstructured data, such as textual data. Multiple tasks, such as text categorization, text clustering, entity extraction, document summarization, and entity relation modeling, can be implemented via text mining [75]. Recently, machine-learning-based text analytics has garnered enthusiasm in its application to the analysis of sustainability and CSR/ESG-related textual disclosures and reports [18,26,27]. Considering the wide prevalence of unstructured data, extracting knowledge from these corporate disclosures is discerning for application and research. For instance, Bayesian-combined text mining has been applied to the study of economic, environmental, and social factors in a large sample of sustainability reports, revealing that the content of the reports is fragmented even under the Global Reporting Initiative [30]. Text mining using multivariate discriminant analysis has been used to investigate sustainability reports, indicating that firms' disclosures differ across industrial sectors. Text mining has also been implemented in the comparative analysis of sustainability reports before and after an industrial crisis, demonstrating a decline in the quantity of disclosed information during the year of the crisis [41]. Computer-based content analysis has been used to explore sustainable supply chain management trends based on news articles and sustainability reports [76]. Green IT practices across industries have also been researched using dictionary-based text mining [77]. The trend and pattern of firms' sustainability reports in the Indonesian construction industry have been studied via text mining, enabling stakeholders to interpret the firms' sustainability strategies [78]. Moreover, semi-supervised text mining using association rules has been adopted to study the maritime industry's sustainability reports, providing managerial implications for the maritime industry to satisfy SDGs [79]. Hence, applying a text-mining method to explore sustainability disclosure information is acknowledged as appropriate. Due to the prevalence of unstructured data, Sustainability 2023, 15, 4134 6 of 32 extraction of knowledge from these corporate disclosures is highly significant for research as well as for application.
The power of natural language processing (NLP), machine learning (ML), and text analytics together in spontaneously discovering patterns and trends in textual documents cannot be underestimated. By using these multiple methods simultaneously, the researcher can prepare and can convert large volumes of unstructured data ("big data") into a ordered format that is appropriate for the application of ML techniques, and further conduct analysis [80]. Using text analytics, researchers can wrest various facets and the relationships among them. In the current research context, SEC filings about sustainability proffer a justified arena of study. It provides a substantial canon of ill-structured textual data. It is also indicative of significant reflections of top management and therefore presents a requisite depository to investigate. Since businesses disseminate their allegiance to various societal missions, it is only natural that the disclosures be probed in detail.
The methodology of machine-learning-based text analytics has been applied in numerous streams of research to analyze unstructured textual data, such as tweets on vaccination [81], shareholder resolutions about sustainability [18], sustainability disclosure on the web [17,26,27], or legal analysis [82]. To reiterate, machine learning and text analytics together offer powerful analytical tools to interpret big text data [26,27,32,83]. The understanding gained from the study would guide top executives, shareholders, institutional and individual investors, and watchdogs to make informed decisions.
Compared with other text-mining methods, LDA is the most straightforward probabilistic approach to model topics from textual documents [26,27]. LDA generates a list of topics that have weights for each document, making it suitable to handle large datasets and produces insightful results. The methods of TF-IDF, cluster analysis, and LDA are three of the more appropriate and widely used ML-based text analytic methods and, therefore, used in this study [26,27].

Methods, Results, and Analysis
This text-analytics-based research study delves into the sustainability disclosures of companies to understand the expansive sustainability issues, practices, and trends. The origin of the disclosures' data is the website of Ceres, a sustainability-focused, non-profit concern that has made available several searchable databases (https://tools.ceres.org/ resources/tools/sec-sustainability-disclosure/ (accessed on 1 September 2021)) that offer a rich source for analysis. According to Ceres, its SEC Sustainability Disclosure Search tool can assist in gaining insight into how companies handle the perils as well as the benefits that may arise from issues such as climate change, carbon asset risk, water availability and quality, and hydraulic fracturing (ceres.com). Their tool searches the text of SEC annual filings (10-K, 20-F, and 40-F) and readily identifies pertinent disclosures. According to Ceres, it then distinguishes the information, identifies the location of the disclosure in the filing, and makes it all available in the searchable database (ceres.com) [84,85]. To this end, our data are from a legitimate source, namely the Securities and Exchange Commission, a government entity. The database makes it easy for researchers to access the disclosures via an API and be able to analyze the large amount of text data to gain insight into corporate sustainability practices (https://www.ceres.org/resources/tools/sec-sustainabilitydisclosure-search-tool (accessed on 1 September 2021)). Such a tool is also useful to investors and other stakeholders interested in knowing more about the opportunities and risks presented by climate change and other ESG issues in companies. According to Ceres, these disclosures in the SEC filings can be compared with the voluntary disclosures that are made via other channels, such as those made to non-profits. To reiterate, machine learning and text analytics together offer powerful analytical tools to interpretand study a large corpus of textual data in a productive way [26,32,83]. The understanding accrued from the study would guide top executives, shareholders, institutional and individual investors, and watchdogs to make informed decisions. The text data were retrieved using the Ceres API, and they were cleaned using advanced data-cleansing methods. Preprocessing steps were then applied. However, we focused only on the disclosures relating to climate change risks filed between 2011 and 2020. The API-enabled disclosure information, such as the company name, division, industry, year, and ticker, and the disclosure content were all in JSON format. Therefore, Python was used to extract all file name numbers and climate change risk disclosure content, where available. Data were retrieved as of 30 September 2020. There were a total of 53,418 files (disclosures) provided by 5347 companies. However, only 48% of the files mentioned climate change risks. So, machine-learning-based text analysis techniques were applied only to those 25,428 files. First, we conducted descriptive data analysis to obtain a panoramic view of the disclosures. Next, we conducted topic modeling and other text analytic methods. For example, we constructed the term frequency-inverse document frequency (TF-IDF) model and K-means to group industries with similar foci. made via other channels, such as those made to non-profits. To reiterate, machine learning and text analytics together offer powerful analytical tools to interpretand study a large corpus of textual data in a productive way [26,32,83]. The understanding accrued from the study would guide top executives, shareholders, institutional and individual investors, and watchdogs to make informed decisions. The text data were retrieved using the Ceres API, and they were cleaned using advanced data-cleansing methods. Preprocessing steps were then applied. However, we focused only on the disclosures relating to climate change risks filed between 2011 and 2020. The API-enabled disclosure information, such as the company name, division, industry, year, and ticker, and the disclosure content were all in JSON format. Therefore, Python was used to extract all file name numbers and climate change risk disclosure content, where available. Data were retrieved as of 30 September 2020. There were a total of 53,418 files (disclosures) provided by 5347 companies. However, only 48% of the files mentioned climate change risks. So, machine-learningbased text analysis techniques were applied only to those 25,428 files. First, we conducted descriptive data analysis to obtain a panoramic view of the disclosures. Next, we conducted topic modeling and other text analytic methods. For example, we constructed the term frequency-inverse document frequency (TF-IDF) model and K-means to group industries with similar foci. Figure 1 outlines our overall research methodology.  Table 1 summarizes the disclosure rate for the period 2011-2020 that we extracted and synthesized from the files. It shows the percentage of companies that disclosed climate-change-related information in their filings for each of the years. As seen, the number of companies providing disclosures related to climate change was relatively stable during the 10 years of this study and most of the disclosures were typically provided by publicly traded companies.  Table 1 summarizes the disclosure rate for the period 2011-2020 that we extracted and synthesized from the files. It shows the percentage of companies that disclosed climatechange-related information in their filings for each of the years. As seen, the number of companies providing disclosures related to climate change was relatively stable during the 10 years of this study and most of the disclosures were typically provided by publicly traded companies. Likewise, it was observed that all disclosures follow the same structure in the Ceres API. For example, Figure 2 shows the disclosure information for the company 1ST SOURCE CORP in 2017, which is under the "Banks and Financial Service" industry group and the "Finance, Insurance, and Real Estate" division. It also shows all disclosure contents related to climate change for the company in the 2017 annual filing. Likewise, it was observed that all disclosures follow the same structure in the Ceres API. For example, Figure 2 shows the disclosure information for the company 1ST SOURCE CORP in 2017, which is under the "Banks and Financial Service" industry group and the "Finance, Insurance, and Real Estate" division. It also shows all disclosure contents related to climate change for the company in the 2017 annual filing. As mentioned previously, Python was used to retrieve all the data from Ceres, which were then transformed into a structured DataFrame. Table 2 shows the variables extracted from each company's disclosure collectively. Besides the basic information about a company, such as the name, industry, and ticker, Ceres also provided four risk ratios. These ratios were extracted as well. They were calculated by dividing the frequency of relevant keywords/phrases in each risk category by the frequency of total keywords/phrases appearing in the excerpt. The four ratios would add up to 1 (100%). As mentioned previously, Python was used to retrieve all the data from Ceres, which were then transformed into a structured DataFrame. Table 2 shows the variables extracted from each company's disclosure collectively. Besides the basic information about a company, such as the name, industry, and ticker, Ceres also provided four risk ratios. These ratios were extracted as well. They were calculated by dividing the frequency of relevant keywords/phrases in each risk category by the frequency of total keywords/phrases appearing in the excerpt. The four ratios would add up to 1 (100%). The four risk ratios were pre-defined by Ceres. They highlighted the relevant keywords/phrases in the excerpt and classified them into four risk categories. Ceres then calculated the distribution for terms in each risk category. This was to provide the values for the four risk ratios for each report (disclosure). The sample terms and phrases for each risk category are shown in Table 3. Appendix A provides a snapshot view of the industry (division) of companies with disclosures in the year 2020. Table 3. Sample terms and phrases for risk categories (source: ceres.org (accessed on 1 September 2021)).

Risk Categories Sample Terms and Phrases
Non-specific climate disclosure Climate change, fossil fuels, consumes significant energy

Text Analytics
Next, machine-learning-based text analytics was applied to examine the disclosures. To apply ML and text analytics, considerable data engineering preparation and datapreprocessing methods were adopted. The retrieved data for each disclosure were saved as a text file prior to vectorization. A smatter of disclosures was deleted due to a paucity of relevancy/materiality and broken links. A caveat with text data is that predictive analytic models developed with text data muddy the analytical methods. First, textual data cannot be used as input in many precise numerical or quantitative models. Therefore, an NLP program was run to transfigure the textual content into distinct components elements for additional scrutiny. Second, text-based datasets are bigger in terms of volume vis-à-vis quantitative datasets. Therefore, a robust archetype warrants data extraction by recognizing the most appropriate chunks of data. In the data engineering step, the granular information in the summary section was converted into simple text output (files) through the removal of typical punctuation marks, spaces, numbers, and standard stop words using the Natural Language Toolkit (NLTK). Next, the text was changed to lowercase using the NLTK and TextBlob (https://textblob.readthedocs.io/en/dev/ (accessed on 1 September 2021)) for uniformity. Tokenization was applied to this engineered data to break up the sequence of strings into word groups [18,81]. Next, the lemmatization approach was carried out to reduce the words into their bare root shell (e.g., "bought" and "buying" were substituted with "buy"). Lemmatization draws on aggregating in unison the various manifestations of each word. This facilitates the examination of a collection of words as composite words [18,81]. The benefit lies in finding meaning within and between the composite words. The pandas package (https://pandas.pydata.org/) (accessed on 1 September 2021) was used to screen and arrange the text files into data frames for ready analysis. The sklearn package (https://scikit-learn.org/stable/index.html (accessed on 1 September 2021)) was then used to iterate the output. When planning to conduct text analytics, the researcher's first thought would be to use NLP, typically one of the most rudimentary of models, namely the Bag of Words. However, such models fail to capture the syntactic relationship between words. For example, suppose TF-IDF (term frequency-inverse document frequency) was calculated based on only the Bag of Words. This model will not be able to capture the difference between "I live in the United States," where "states" is a noun, and "He states his intention," where "states" is a verb. Part-of-speech (POS) tagging is the method by which words in a corpus can be marked up to the analogous part of the speech tag based on its definition and context. For example, in the sentence "Give me your answer," answer is a noun, but in the sentence "Answer the question," answer is a verb. POS tagging can differentiate words based on their functions in sentences. There are 36 different types of POS tags, and the common tags are adjective, verb, noun, proper noun, adverbs, and others. In this study, POS tagging techniques were applied to preprocessed tokens and only tokens in noun and proper noun tags were retained. The rationale is that the nouns and proper nouns carry more meaningful information than other tags. Figure 3 shows the top 55 noun and proper nouns and frequencies in the overall disclosure reports during 2011 to 2020. For a detailed count of each term, please refer to Appendix B.
pandas package (https://pandas.pydata.org/) (accessed on 1 September 2021) was used to screen and arrange the text files into data frames for ready analysis. The sklearn package (https://scikit-learn.org/stable/index.html (accessed on 1 September 2021)) was then used to iterate the output. When planning to conduct text analytics, the researcher's first thought would be to use NLP, typically one of the most rudimentary of models, namely the Bag of Words. However, such models fail to capture the syntactic relationship between words. For example, suppose TF-IDF (term frequency-inverse document frequency) was calculated based on only the Bag of Words. This model will not be able to capture the difference between "I live in the United States," where "states" is a noun, and "He states his intention," where "states" is a verb. Part-of-speech (POS) tagging is the method by which words in a corpus can be marked up to the analogous part of the speech tag based on its definition and context. For example, in the sentence "Give me your answer," answer is a noun, but in the sentence "Answer the question," answer is a verb. POS tagging can differentiate words based on their functions in sentences. There are 36 different types of POS tags, and the common tags are adjective, verb, noun, proper noun, adverbs, and others. In this study, POS tagging techniques were applied to preprocessed tokens and only tokens in noun and proper noun tags were retained. The rationale is that the nouns and proper nouns carry more meaningful information than other tags. Figure 3 shows the top 55 noun and proper nouns and frequencies in the overall disclosure reports during 2011 to 2020. For a detailed count of each term, please refer to Appendix B. As shown in Figure 3, the darker the color and the larger the size, the more frequently the word appears in the overall disclosure report. It is obvious that energy-related words, such as gas, emission, power, and oil, formed a substantial portion of the content. Climatechange-related words, such as greenhouse, ghg, water, air, and carbon dioxide, were also quite significant. Furthermore, regulation-related terms, such as legislation, compliance, regulation, and agreement, were prominent as well.
The term frequency-inverse document frequency (TF-IDF) technique mentioned before was next applied to calculate the weight of each term to denote its relative importance in a document. This is a data retrieval approach that allocates a weight to a term's frequency (TF) and its inverse document frequency (IDF). Each term is given these two scores. The weight of the term is then computed as a product of these scores. TF-IDF is a statistical modeling tool used to assess the worth of a word to a document that belongs in a corpus. The value of a word increases vis-à-vis the frequency of its appearance in a document [29,44]. Simultaneously, it decreases in reverse symmetry to the frequency of its presence in the corpus. The primary thrust of TF-IDF is that when a word or phrase appears in a document with a greater frequency of TF, and seldom shows up in other As shown in Figure 3, the darker the color and the larger the size, the more frequently the word appears in the overall disclosure report. It is obvious that energy-related words, such as gas, emission, power, and oil, formed a substantial portion of the content. Climatechange-related words, such as greenhouse, ghg, water, air, and carbon dioxide, were also quite significant. Furthermore, regulation-related terms, such as legislation, compliance, regulation, and agreement, were prominent as well.
The term frequency-inverse document frequency (TF-IDF) technique mentioned before was next applied to calculate the weight of each term to denote its relative importance in a document. This is a data retrieval approach that allocates a weight to a term's frequency (TF) and its inverse document frequency (IDF). Each term is given these two scores. The weight of the term is then computed as a product of these scores. TF-IDF is a statistical modeling tool used to assess the worth of a word to a document that belongs in a corpus. The value of a word increases vis-à-vis the frequency of its appearance in a document [29,44]. Simultaneously, it decreases in reverse symmetry to the frequency of its presence in the corpus. The primary thrust of TF-IDF is that when a word or phrase appears in a document with a greater frequency of TF, and seldom shows up in other documents, the word or phrase has reasonable classification prowess and is competent for classification problems. In this study, the companies' disclosures were divided into 31 industry groups to calculate the TF-IDF value for the 55 extracted keywords. The resulting matrix is a data frame with a size of 31 × 55. Figure 4 displays partial examples of the final matrix. documents, the word or phrase has reasonable classification prowess and is competent for classification problems. In this study, the companies' disclosures were divided into 31 industry groups to calculate the TF-IDF value for the 55 extracted keywords. The resulting matrix is a data frame with a size of 31 × 55. Figure 4 displays partial examples of the final matrix. There were a total of 31 rows representing 31 different industry group. Each row contained 55 values, and a high value for a keyword implied that the high term frequency was weighted by the oddity of the keyword's occurrence. Subsequently, the K-means clustering model was developed to surface-out the key sustainability concepts. K-means, a widely used ML algorithm for clustering, is typically set to classify cases based on similarity measures (i.e., the distance between the cases). It is generally applied to solve classification and pattern recognition problems. For this study, six high-level clusters were identified with a word cloud utility. Classification with the KNN classifier (a supervised machine learning algorithmic method) next ensued. In this application, the data were partitioned into two sets (training and testing datasets) to assess the success of the classifying algorithm. The rationale is to recognize the varied subclasses (i.e., the clusters) in the data such that the observations (cases) in the same subclass are akin, while the observations in the different clusters are distinct from each other. The TF-IDF features of the top 55 keywords for each industry were extracted as our model input variables. K-means clustering was implemented to obtain the clusters, which were then labeled numerically. Since the goal was to find the patterns of keywords that describe different industries, the K-means algorithm was used to build the segmentation of six clusters formed from among the 31 industry groups that contained 4461 companies. The model was evaluated using the silhouette score. The silhouette score is an average measure of how close each point in one cluster is to points in the neighboring clusters, and a positive value indicates the point has been classified to a correct cluster. The six-cluster model had an acceptable average silhouette score of 0.4, as shown in Figure 5.  There were a total of 31 rows representing 31 different industry group. Each row contained 55 values, and a high value for a keyword implied that the high term frequency was weighted by the oddity of the keyword's occurrence. Subsequently, the K-means clustering model was developed to surface-out the key sustainability concepts. K-means, a widely used ML algorithm for clustering, is typically set to classify cases based on similarity measures (i.e., the distance between the cases). It is generally applied to solve classification and pattern recognition problems. For this study, six high-level clusters were identified with a word cloud utility. Classification with the KNN classifier (a supervised machine learning algorithmic method) next ensued. In this application, the data were partitioned into two sets (training and testing datasets) to assess the success of the classifying algorithm. The rationale is to recognize the varied subclasses (i.e., the clusters) in the data such that the observations (cases) in the same subclass are akin, while the observations in the different clusters are distinct from each other. The TF-IDF features of the top 55 keywords for each industry were extracted as our model input variables. K-means clustering was implemented to obtain the clusters, which were then labeled numerically. Since the goal was to find the patterns of keywords that describe different industries, the K-means algorithm was used to build the segmentation of six clusters formed from among the 31 industry groups that contained 4461 companies. The model was evaluated using the silhouette score. The silhouette score is an average measure of how close each point in one cluster is to points in the neighboring clusters, and a positive value indicates the point has been classified to a correct cluster. The six-cluster model had an acceptable average silhouette score of 0.4, as shown in Figure 5. documents, the word or phrase has reasonable classification prowess and is competent for classification problems. In this study, the companies' disclosures were divided into 31 industry groups to calculate the TF-IDF value for the 55 extracted keywords. The resulting matrix is a data frame with a size of 31 × 55. Figure 4 displays partial examples of the final matrix. There were a total of 31 rows representing 31 different industry group. Each row contained 55 values, and a high value for a keyword implied that the high term frequency was weighted by the oddity of the keyword's occurrence. Subsequently, the K-means clustering model was developed to surface-out the key sustainability concepts. K-means, a widely used ML algorithm for clustering, is typically set to classify cases based on similarity measures (i.e., the distance between the cases). It is generally applied to solve classification and pattern recognition problems. For this study, six high-level clusters were identified with a word cloud utility. Classification with the KNN classifier (a supervised machine learning algorithmic method) next ensued. In this application, the data were partitioned into two sets (training and testing datasets) to assess the success of the classifying algorithm. The rationale is to recognize the varied subclasses (i.e., the clusters) in the data such that the observations (cases) in the same subclass are akin, while the observations in the different clusters are distinct from each other. The TF-IDF features of the top 55 keywords for each industry were extracted as our model input variables. K-means clustering was implemented to obtain the clusters, which were then labeled numerically. Since the goal was to find the patterns of keywords that describe different industries, the K-means algorithm was used to build the segmentation of six clusters formed from among the 31 industry groups that contained 4461 companies. The model was evaluated using the silhouette score. The silhouette score is an average measure of how close each point in one cluster is to points in the neighboring clusters, and a positive value indicates the point has been classified to a correct cluster. The six-cluster model had an acceptable average silhouette score of 0.4, as shown in Figure 5.  In Figure 6, the taxonomy of clusters and keywords is displayed. The six different clusters are shown in different colors and shapes, and for each cluster, the most important keywords are identified. One can describe each cluster based on its keywords. The silhouette score is used to appraise the caliber of a cluster identified with the use of standard clustering algorithms, such as K-means. The issue addressed is to what degree are the observations clustered with other observations that are like one another. The value of the silhouette score varies from −1 to 1. When the score is 1, the clusters are distinct and sharply delineated. A value close to 0 indicates congruent clusters with the observations in proximity to the decision boundary of the nearby clusters. In this study, the derived silhouette score of 0.4 indicated the clusters were quite distinct. In Figure 6, the taxonomy of clusters and keywords is displayed. The six different clusters are shown in different colors and shapes, and for each cluster, the most important keywords are identified. One can describe each cluster based on its keywords. The silhouette score is used to appraise the caliber of a cluster identified with the use of standard clustering algorithms, such as K-means. The issue addressed is to what degree are the observations clustered with other observations that are like one another. The value of the silhouette score varies from −1 to 1. When the score is 1, the clusters are distinct and sharply delineated. A value close to 0 indicates congruent clusters with the observations in proximity to the decision boundary of the nearby clusters. In this study, the derived silhouette score of 0.4 indicated the clusters were quite distinct. There were in total 6 clusters that were marked with different colors and shapes. The number of shapes in each cluster represents the number of industries that have been grouped into that cluster.
As observed in Figure 6, the blue circle represents the top five key features in cluster 1. These are different from the other five clusters. Table 4 gives detailed information about each cluster, the key features, the number of industries, and the specific industries.
Besides common key features, such as gas, ghg, and climate, industries in different clusters have their own specific focus. For example, oil/gas manufacturing and consumption industries, such as oil and gas, transportation, and aerospace and defense are in cluster 1 with the unique key feature "oil." An eclectic group of industries, representing "climate" change, are in cluster 2. One could extrapolate that all industries are concerned about climate change. Indeed, several industries fall into this cluster. Cluster 3 is labeled as "energy," and the main industries in this cluster are electric power and gas utilities. Cluster 4 has the greatest number of industries and is associated with "impact," which has a couple of interpretations. One, companies are concerned about the impact of climate change and environmental degradation on their business and on the planet. Two, companies have been concerned in recent years with the effect of climate change and the environment on financial performance and shareholder value. In cluster 5, we see some association with "cost," likely related to the loss and damage incurred due to climate change and an adverse environment (e.g., the negative effect of extreme weather).
Heavy emission industries, such as automobile and coal mining, are in cluster 6 with the unique key feature "ghg." When compared to the "issues" available at ceres.com, climate change, carbon, and water are common topics, while this study surfaced "impact" (of sustainability) as a key topic. There were in total 6 clusters that were marked with different colors and shapes. The number of shapes in each cluster represents the number of industries that have been grouped into that cluster.
As observed in Figure 6, the blue circle represents the top five key features in cluster 1. These are different from the other five clusters. Table 4 gives detailed information about each cluster, the key features, the number of industries, and the specific industries. Besides common key features, such as gas, ghg, and climate, industries in different clusters have their own specific focus. For example, oil/gas manufacturing and consumption industries, such as oil and gas, transportation, and aerospace and defense are in cluster 1 with the unique key feature "oil." An eclectic group of industries, representing "climate" change, are in cluster 2. One could extrapolate that all industries are concerned about climate change. Indeed, several industries fall into this cluster. Cluster 3 is labeled as "energy," and the main industries in this cluster are electric power and gas utilities. Cluster 4 has the greatest number of industries and is associated with "impact," which has a couple of interpretations. One, companies are concerned about the impact of climate change and environmental degradation on their business and on the planet. Two, companies have been concerned in recent years with the effect of climate change and the environment on financial performance and shareholder value. In cluster 5, we see some association with "cost," likely related to the loss and damage incurred due to climate change and an adverse environment (e.g., the negative effect of extreme weather).
Heavy emission industries, such as automobile and coal mining, are in cluster 6 with the unique key feature "ghg." When compared to the "issues" available at ceres.com, climate change, carbon, and water are common topics, while this study surfaced "impact" (of sustainability) as a key topic.
As discussed previously, a text-based dataset contains rich, yet complicated information compared to numerical datasets, so a data-preprocessing step was needed to tease the relevant information out of the lengthy text. Second, besides TF-IDF, several other textual analysis methods, such as sentiment analysis, POS tagging, and doc2vec, were also applied.
To gain insight into the regulatory aspects mentioned by companies in their sustainability disclosures, the statutes typically mentioned in terms of compliance by extracting law-related phrases using spaCy (https://spacy.io/ (accessed on 1 September 2021)), a natural language processing API, were identified. This allows one to extract chunks of text by using POS tagging annotation or a phrase pattern. For example, if one wants to extract all the adjectives followed by a noun, one can specify {POS:ADJ},{POS:NOUN} and the API will return all the matched results. Specifically, we used the matcher engine called Rule Based Matcher to perform token-based matching to find descriptive words for "laws," "Act," and "regulations." We chose to extract words connected with "Act" because they were domain specific to law. For example, this gives "Clean Water Act" or "Federal Power Act" instead of vague law phrases, such as "related requirements" or "state law." A caveat is we did not know which state. At first, POS tags were used to set the matching patterns, such as below: As discussed previously, a text-based dataset contains rich, yet complicated information compared to numerical datasets, so a data-preprocessing step was needed to tease the relevant information out of the lengthy text. Second, besides TF-IDF, several other textual analysis methods, such as sentiment analysis, POS tagging, and doc2vec, were also applied.
To gain insight into the regulatory aspects mentioned by companies in their sustainability disclosures, the statutes typically mentioned in terms of compliance by extracting law-related phrases using spaCy (https://spacy.io/ (accessed on 1 September 2021)), a natural language processing API, were identified. This allows one to extract chunks of text by using POS tagging annotation or a phrase pattern. For example, if one wants to extract all the adjectives followed by a noun, one can specify {POS:ADJ},{POS:NOUN} and the API will return all the matched results. Specifically, we used the matcher engine called Rule Based Matcher to perform token-based matching to find descriptive words for "laws," "Act," and "regulations." We chose to extract words connected with "Act" because they were domain specific to law. For example, this gives "Clean Water Act" or "Federal Power Act" instead of vague law phrases, such as "related requirements" or "state law." A caveat is we did not know which state. At first, POS tags were used to set the matching patterns, such as below: However, we noticed some specific statutes or act names were abbreviated or contained values such as "10-K SEC Act," which could not be captured by the POS tagging patterns. Given the fact that the descriptive words before "Act" always modify it in this case, another approach was taken by setting the wildcard token (token with wildcard logic) as the matching pattern, which tells the matcher to search for and extract any two or three tokens that the word "Act" follows. As a result of this method, one can locate all the sentences that contain "Act" and find out which "Act" is the most concerning and frequently mentioned during the past 10 years. In Figure 7, a bar chart shows the most cited statutes (acts) in the disclosure reports during the past 10 years. case, another approach was taken by setting the wildcard token (token with wildcard logic) as the matching pattern, which tells the matcher to search for and extract any two or three tokens that the word "Act" follows. As a result of this method, one can locate all the sentences that contain "Act" and find out which "Act" is the most concerning and frequently mentioned during the past 10 years. In Figure 7, a bar chart shows the most cited statutes (acts) in the disclosure reports during the past 10 years. The top 30 federal and state statutes by frequency cited in the disclosures are displayed. The implication is the higher the frequency of mention, the more relevant the statute to sustainability-related activities. The extraction of the statutes is a novel contribution of this study. Further, we were also interested in the types of greenhouse gases mentioned most frequently in the companies' disclosure reports during the past 10 years. The following donut chart was developed by counting the number of occurrences of various types of greenhouse gases mentioned in the companies' disclosures and the proportional relationship between them and the total number of counts. This is shown in Figure 8. The top 30 federal and state statutes by frequency cited in the disclosures are displayed. The implication is the higher the frequency of mention, the more relevant the statute to sustainability-related activities. The extraction of the statutes is a novel contribution of this study. Further, we were also interested in the types of greenhouse gases mentioned most frequently in the companies' disclosure reports during the past 10 years. The following donut chart was developed by counting the number of occurrences of various types of greenhouse gases mentioned in the companies' disclosures and the proportional relationship between them and the total number of counts. This is shown in Figure 8. The chart shows that carbon dioxide is the main source of greenhouse gas by methane. Companies appear to pay close attention to the emission of car and methane, followed by other gases. The fact that greenhouse gases are even in the disclosures indicate companies are getting interested in addressing gree caused climate change. To get a sense of how much a company's disclosure c time, the content variance for each company was calculated. First, the NLTK P The chart shows that carbon dioxide is the main source of greenhouse gases, followed by methane. Companies appear to pay close attention to the emission of carbon dioxide and methane, followed by other gases. The fact that greenhouse gases are even mentioned in the disclosures indicate companies are getting interested in addressing greenhouse-gascaused climate change. To get a sense of how much a company's disclosure changes over time, the content variance for each company was calculated. First, the NLTK POS tagging package was used to extract top 20 most frequent noun keywords from the companies' most recent and oldest disclosure reports, and then, cosine similarity was applied between the two vectorized lists. For example, the company Advance Auto Parts Inc. (ticker: AAP) has been providing climate-change-related disclosures consistently from 2011 to 2020. For the 10 years of disclosure content, the top 20 most frequent noun keywords of its disclosure report for 2011 and those in its report for 2020 were compared. A content variance of 1 means the two disclosures' top 20 keywords are the same, whereas a content variance of 0 means the two disclosures' top 20 keywords share nothing in common. The content variance for each company was calculated and the results visualized using a box-and-whisker plot for all sectors. Figure 9, a box-and-whisker graph, exhibits the content variance distribution across the different sectors of all companies that provided climate-change-related disclosures.  Figure 9 shows that the median content variance scores for eight sectors were all around 0.45. Since its box is more skewed downward, the transportation and communications division tends to have a similarity score closer to 0 compared to other sectors. We concluded that the disclosure contents for companies in the transportation and communications sector vary more over time when compared to other sectors. Doc2Vec similarity captures how similar a company's disclosure compares with other companies' disclosure  Figure 9 shows that the median content variance scores for eight sectors were all around 0.45. Since its box is more skewed downward, the transportation and com-munications division tends to have a similarity score closer to 0 compared to other sectors. We concluded that the disclosure contents for companies in the transportation and communications sector vary more over time when compared to other sectors. Doc2Vec similarity captures how similar a company's disclosure compares with other companies' disclosure reports within the same industry. The methodology for Doc2Vec (https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e (accessed on 1 September 2021)) is based on Word2Vec (https://en.wikipedia.org/wiki/Word2vec (accessed on 1 September 2021)), a tool for unsupervised learning of uninterrupted depictions of large chunks of text, such as sentences, paragraphs, or even whole documents. This text analysis technique was applied to create a numeric representation of a document regardless of its length. The most recent disclosure report for each company was used as the input content. The Doc2Vec model was trained with all the disclosure reports and additional data, such as id and industry type, as additional vectors. The Doc2Vec score was calculated for each disclosure. If two disclosure reports within the same industry carried similar Doc2Vec scores, one can deduce that the content of these two reports is similar in terms of the expressions and words/phrases used. Examining the Doc2Vec score itself does not provide any insight. However, if the variance of these scores is calculated, one can gain a sense of how different or how variant the content is within a certain sector. Figure 10   The darker and larger the box, the larger the variance, and the more variant the content within that specific sector. As shown in Figure 10, the variance of Doc2Vec similarity scores was the highest in the wholesale trade sector, meaning the disclosure content of companies in the wholesale trade sector varies more from each other than from companies in other sectors.

Disclosure Sentiment
The well-known and commonly used technique of sentiment analysis computationally distinguishes and compartmentalizes beliefs expressed in the text. This helps espe- The darker and larger the box, the larger the variance, and the more variant the content within that specific sector. As shown in Figure 10, the variance of Doc2Vec similarity scores was the highest in the wholesale trade sector, meaning the disclosure content of companies in the wholesale trade sector varies more from each other than from companies in other sectors.

Disclosure Sentiment
The well-known and commonly used technique of sentiment analysis computationally distinguishes and compartmentalizes beliefs expressed in the text. This helps especially to determine what the opinionator's stance is toward a particular topic, etc. (https://en.wikipedia. org/wiki/Sentiment_analysis (accessed on 1 September 2021)). Given that one of the goals of this study was to explore the relationship between a company's risk propensity and climate change, it is useful to obtain insight into a company's position on regulatory, climate change, and renewable energy policies. VADER (Valence Aware Dictionary and Sentiment Reasoner) is a commonly used lexicon and rule-based sentiment analysis tool that is particularly geared toward the extraction of sentiment expressed in social media platforms and functions well on texts in other spheres (https://pypi.org/ (accessed on 1 September 2021)). The advantage of VADER is it can be used to analyze large amounts of text data, with a relatively high accuracy, sometimes even outperforming the human raters (http://comp.social.gatech.edu/ papers/icwsm14.vader.hutto.pdf (accessed on 1 September 2021)). It is sensitive to both the polarity (positive negative) and intensity of emotions. Besides this, VADER also has a basic understanding of the context of the text it reads. For example, "love" is a word that conveys positive meaning. VADER can recognize "did not love" as a negative statement because of its context awareness. It generates three polarity scores for a given text: positive, negative, and compound. The negative sentiment scores were used in this study since they better reflect the challenge of climate-change-related risk to a company's business. The challenge faced in this analysis was to extract pieces of clean text containing the topics that we desired to evaluate, while discarding irrelevant data. An approach that used NLTK sentence segmentation-a sentence tokenizer to split a large disclosure corpus into sentences-was adopted. An attempt was made to search for topics in each sentence, followed by sentiment analysis. The overarching goal was to obtain the sentiment scores of environmental regulatory, climate change, and renewable technology to gain insight into the companies' views on these three risk categories (excluding physical risk). Three sets of risk-category-related keywords were created (shown in Table 5), including regulation, climate issue, and technology sentiments. The keywords related to regulation were scoped to include words such as "EPA" and "GHG." Three sets of sentiment analysis were conducted to capture companies' perspectives on the three risk categories. The keywords were selected from the top 30 frequent keywords that were mentioned in the 25,428 disclosures. Figure 11 is a line chart showing the compound sentiment score for the aspects of regulatory-related company disclosures. As shown in the chart, the increasing trend in the wholesale trade industry implies companies in this sector have complied with laws and regulation more effectively during the recent year. The compound sentiment in the mining sector decreased from 2019 to 2020. Figure 12 shows the compound sentiment score for the aspects of climate-changerelated company disclosures. As shown in the chart, the compound sentiment score for agriculture disclosures dramatically decreased during recent years. This implies that the agricultural business has been severely affected by the recent escalation in climate change and that companies perceive climate change as a risk factor to their business. Figure 11 is a line chart showing the compound sentiment score for the aspects regulatory-related company disclosures. As shown in the chart, the increasing trend in th wholesale trade industry implies companies in this sector have complied with laws an regulation more effectively during the recent year. The compound sentiment in the minin sector decreased from 2019 to 2020.  Figure 12 shows the compound sentiment score for the aspects of climate-changerelated company disclosures. As shown in the chart, the compound sentiment score for agriculture disclosures dramatically decreased during recent years. This implies that the agricultural business has been severely affected by the recent escalation in climate change and that companies perceive climate change as a risk factor to their business.  Figure 12 shows the compound sentiment score for the aspects of climate-changerelated company disclosures. As shown in the chart, the compound sentiment score for agriculture disclosures dramatically decreased during recent years. This implies that the agricultural business has been severely affected by the recent escalation in climate change and that companies perceive climate change as a risk factor to their business.  Figure 13 shows the compound sentiment score for the aspects of renewable-energyrelated company disclosures. As shown in the chart, the compound sentiment score in  Figure 13 shows the compound sentiment score for the aspects of renewable-energyrelated company disclosures. As shown in the chart, the compound sentiment score in agriculture disclosures dropped during the year 2020. This is because the number of disclosures in the agricultural sector is low, so the base number of the average is high, thereby resulting in extreme values. We next turn our attention to topic modeling. agriculture disclosures dropped during the year 2020. This is because the number of disclosures in the agricultural sector is low, so the base number of the average is high, thereby resulting in extreme values. We next turn our attention to topic modeling.

Topic Modeling (LDA)
The aim of latent Dirichlet allocation (LDA) is to assist in the classification of the content of disclosures reports and find a mix of topics that companies focused on. LDA was applied to the disclosure content to elicit the key topics addressed by the disclosures [26,27]. By assigning each disclosure to a cluster, the model can glean the hidden characteristics in the text and generate labels for the supervised learning model. Since unsupervised learning models have a high degree of uncertainty, Gensim's LDA implementation was used (https://radimrehurek.com/gensim_3.8.3/models/ldamodel.html (accessed on 1 September 2021)). This uses the variational Bayes sampling method as model 1 and mallet LDA (http://mallet.cs.umass.edu/ (accessed on 1 September 2021)), which is typically more precise than Gensim, as model 2 to conduct the cross-validation. Perplexity and coherence scores were used as evaluation standards to compare the topic models. Each disclosure content was assigned to a dominant topic as the content label. This way, one would know which aspect the specific company's disclosure report focuses on. Based on this

Topic Modeling (LDA)
The aim of latent Dirichlet allocation (LDA) is to assist in the classification of the content of disclosures reports and find a mix of topics that companies focused on. LDA was applied to the disclosure content to elicit the key topics addressed by the disclosures [26,27]. By assigning each disclosure to a cluster, the model can glean the hidden characteristics in the text and generate labels for the supervised learning model. Since unsupervised learning models have a high degree of uncertainty, Gensim's LDA implementation was used (https://radimrehurek.com/gensim_3.8.3/ models/ldamodel.html (accessed on 1 September 2021)). This uses the variational Bayes sampling method as model 1 and mallet LDA (http://mallet.cs.umass.edu/ (accessed on 1 September 2021)), which is typically more precise than Gensim, as model 2 to conduct the cross-validation. Perplexity and coherence scores were used as evaluation standards to compare the topic models. Each disclosure content was assigned to a dominant topic as the content label. This way, one would know which aspect the specific company's disclosure report focuses on. Based on this method, six key topics were identified and are described below. Additionally, Appendix A contains an overview of all 70 topics. It includes the most conspicuous terms in each as well as the probability of occurrence (how high is the likelihood that the specific term appears in the topic) of these terms in a topic. To tweak the models, the text was lemmatized and bigrammed/trigrammed to group related phrases into one token for the model. Moreover, to extract the most sustainability-related topics, English stop words and certain common word lists were removed from the disclosure texts. In tokenization, the sentences were split into words, all words were lowercased, and all punctuations were removed. A stop-word list was developed, and those words were removed as well. Further, lemmatization and stemming were applied again to transform the words into their granular form (e.g., "bought" and "buying" would be replaced with "buy") so they could be analyzed as single objects. Figure 14 provides a general understanding of the focus of topics before constructing the topic model. It shows the term frequencies of the most used words in all disclosure reports. One can observe that "emission," "gas," and "energy" or possibly "climate change" and other environment-related words are frequently mentioned in the disclosures. method, six key topics were identified and are described below. Additionally, Appendix A contains an overview of all 70 topics. It includes the most conspicuous terms in each as well as the probability of occurrence (how high is the likelihood that the specific term appears in the topic) of these terms in a topic. To tweak the models, the text was lemmatized and bigrammed/trigrammed to group related phrases into one token for the model. Moreover, to extract the most sustainability-related topics, English stop words and certain common word lists were removed from the disclosure texts. In tokenization, the sentences were split into words, all words were lowercased, and all punctuations were removed. A stop-word list was developed, and those words were removed as well. Further, lemmatization and stemming were applied again to transform the words into their granular form (e.g., "bought" and "buying" would be replaced with "buy") so they could be analyzed as single objects. Figure 14 provides a general understanding of the focus of topics before constructing the topic model. It shows the term frequencies of the most used words in all disclosure reports. One can observe that "emission," "gas," and "energy" or possibly "climate change" and other environment-related words are frequently mentioned in the disclosures. To build the initial LDA model, the parameter for the number of topics was set to 20, alpha to 0.01, and beta to 0.1 to develop the initial model as a baseline model for performance comparison. The perplexity level and coherence scores were calculated for this purpose. The perplexity level measures the probability of unexpected results in the classification. The coherence score is a value of the semantic similarity between high-frequency words. It assesses the quality of the learned topics; therefore, the higher the score, the better the quality of the topics extracted. The perplexity score measures how credibly the model represents or reproduces the statistics of the hold-out data. From a performance perspective, one would want to build a model with a low perplexity level and a high coherence score by tweaking the topics and other coefficients.
The overall performance of the initial model is shown below: To build the initial LDA model, the parameter for the number of topics was set to 20, alpha to 0.01, and beta to 0.1 to develop the initial model as a baseline model for performance comparison. The perplexity level and coherence scores were calculated for this purpose. The perplexity level measures the probability of unexpected results in the classification. The coherence score is a value of the semantic similarity between highfrequency words. It assesses the quality of the learned topics; therefore, the higher the score, the better the quality of the topics extracted. The perplexity score measures how credibly the model represents or reproduces the statistics of the hold-out data. From a performance perspective, one would want to build a model with a low perplexity level and a high coherence score by tweaking the topics and other coefficients.
The overall performance of the initial model is shown below: Perplexity: −6.39885911951267 Coherence score: 0.4222068061701333 The initial model demonstrated reasonable performance, as indicated by the perplexity level, while the coherence score indicated the model needs additional refinement. Tuning is a way to maximize the performance of a model without overfitting or producing a large variance. A series of sensitivity tests were conducted to determine the number of topics (K), Dirichlet hyperparameter alpha (Document-Topic Density), and Dirichlet hyperparameter beta (Word-Topic Density). The tests were performed sequentially. One parameter was examined at a time by holding others constant. Figure 15 shows the coherence score for the number of topics across one epoch of the validation set with a fixed alpha = 0.01 and beta = 0.1. It plotted the number of topics by coherence score and studied the trend as the number of topics ranged from 2 to 15. We observed that while the coherence score was volatile as the number of topics changed, it had the best value when the number of topics was 6. As the coherence score appeared to decrease after 6 topics, K = 6 was chosen. The initial model demonstrated reasonable performance, as indicated by the perplexity level, while the coherence score indicated the model needs additional refinement. Tuning is a way to maximize the performance of a model without overfitting or producing a large variance. A series of sensitivity tests were conducted to determine the number of topics (K), Dirichlet hyperparameter alpha (Document-Topic Density), and Dirichlet hyperparameter beta (Word-Topic Density). The tests were performed sequentially. One parameter was examined at a time by holding others constant. Figure 15 shows the coherence score for the number of topics across one epoch of the validation set with a fixed alpha = 0.01 and beta = 0.1. It plotted the number of topics by coherence score and studied the trend as the number of topics ranged from 2 to 15. We observed that while the coherence score was volatile as the number of topics changed, it had the best value when the number of topics was 6. As the coherence score appeared to decrease after 6 topics, K = 6 was chosen. The evaluation metrics of the optimal model are shown below. Perplexity: −7.196056573358106 Coherence score: 0.5500174631782340 In further comparing model 1 (Gensim's LDA) to model 2 (mallet LDA), the perplexity level and coherence scores showed nearly identical performance with similar keywords in each topic. Therefore, to scope the analysis, only the results from model 1 (Gensim) were examined further. The results in Table 6 were generated with this LDA model. Six topics were extracted along with their corresponding terms. On examination of the terms, each topic was assigned a label that most reflected the terms within. These topics and terms are indicative of the sustainability issues that companies are typically concerned with. Label categories to some extent reflect the companies' emphasis on a certain topic. The six key topics are gas emission, carbon risk, climate change, loss and damage, renewable energy, and financial impact. When compared to the "issues" at ceres.com, the relatively common topics were climate change and carbon risk. The other topics identified in this study included gas emission, loss and damage, renewable energy, and financial impact. Financial impact was a major topic identified in this study. This is consistent with the extant literature that has emerged on the association between sustainability and company/financial performance [86,87]. The evaluation metrics of the optimal model are shown below. Perplexity: −7.196056573358106 Coherence score: 0.5500174631782340 In further comparing model 1 (Gensim's LDA) to model 2 (mallet LDA), the perplexity level and coherence scores showed nearly identical performance with similar keywords in each topic. Therefore, to scope the analysis, only the results from model 1 (Gensim) were examined further. The results in Table 6 were generated with this LDA model. Six topics were extracted along with their corresponding terms. On examination of the terms, each topic was assigned a label that most reflected the terms within. These topics and terms are indicative of the sustainability issues that companies are typically concerned with. Label categories to some extent reflect the companies' emphasis on a certain topic. The six key topics are gas emission, carbon risk, climate change, loss and damage, renewable energy, and financial impact. When compared to the "issues" at ceres.com, the relatively common topics were climate change and carbon risk. The other topics identified in this study included gas emission, loss and damage, renewable energy, and financial impact. Financial impact was a major topic identified in this study. This is consistent with the extant literature that has emerged on the association between sustainability and company/financial performance [86,87].  Figure 16 displays the topic distance map; the area within the circle characterizes the importance of each topic over the entire corpus, and the distance between the centers of the circles describes the similarity between the topics. The topic distance map was drawn using the built-in function of the pyLDAvis package to visualize the volume of topics and the keywords contained in the topic [86,87]. To the left, the different-size bubbles represent the topics; the larger the bubble, the more companies' disclosures in that topic. The distance between the topics approximates the extent of the semantic relationship between the topics, and if the topics share common keywords, the bubbles will overlap (in proximity). The right part of Figure 16 is a bar chart showing the relative importance of each term in topic 1. The red-shaded area describes the frequency of each term in each topic, while the blue bar shows the frequency distribution of the terms in all disclosures. Topic 1 is the largest topic in all the disclosures put together; the top 30 keywords in the right bar chart represent the most frequent keywords in this topic group. Studying the keywords in Table 6 and Figure 16, topic 1 seems to have disclosures related to greenhouse gas emissions and corresponding legislative action. Topic 1 also overlaps with topic 4 by having a few keywords. Topic 6 has a large distance from the other topics. This makes sense as topic 6 describes the financial impact aspect of sustainability. As mentioned previously, the related keywords describe issues different than the keywords of the other traditional sustainability-related topics. Topics 1, 2, 3, and 4 are in proximity and collectively describe environment-related topics, since they share the related keywords. Among the environmental topics, topic 1 focuses more on greenhouse gas emissions, and topic 2 pays more attention to carbon-related emissions affecting sustainability. Topic 3, however, shows a specific focus on climate change correlated with weather and risks. Topic 4 contains more natural-disaster-related keywords, while topic 5 deals with renewable energy technologies. Topic 6, of course, highlights the financial impact [88,89]. This topic is an interesting finding of this study. Figure 17 is a highlight table showing the count of companies within different sectors that focus predominantly on specific topics. As seen, each sector has its own focus.
For example, manufacturing emphasized topic 2 (carbon risk) and topic 3 (climate change), while finance, insurance, and real estate focused on topic 4 (loss and damage). However, these are not mutually exclusive. Every sector is associated with each of the six topics. Caution must be paid attention to the total number of companies in each sector, as frequency counts are not being compared.  For example, manufacturing emphasized topic 2 (carbon risk) and topic 3 (climate change), while finance, insurance, and real estate focused on topic 4 (loss and damage). However, these are not mutually exclusive. Every sector is associated with each of the six Figure 16. LDA model visualization [86,87].  For example, manufacturing emphasized topic 2 (carbon risk) and topic 3 (climate change), while finance, insurance, and real estate focused on topic 4 (loss and damage). However, these are not mutually exclusive. Every sector is associated with each of the six

Discussion
Our research applied topic modeling and other text analytic methods to approximately 25,428 disclosure reports filed with the SEC in the years 2011-2020 to identify key climate change and environmental issues companies are concerned with. We identify the six key topics as gas emission, carbon risk, climate change, loss and damage, renewable energy, and financial impact. These topics and terms are indicative of the sustainability issues that companies are typically concerned with. Financial impact is a major topic identified in this study. This is consistent with the recent literature that has emerged on the association between ESG/sustainability and company financial performance. Specifically, the analysis indicates carbon dioxide is the main source of greenhouse gases, followed by methane. Companies appear to pay close attention to the emission of carbon dioxide and methane, followed by other gases. The fact that greenhouse gases are even mentioned in the disclosures indicate companies are getting interested in addressing greenhouse-gascaused climate change. The study also finds that besides common key features, such as "gas," "ghg," and "climate," industries in different clusters have their own specific focus. For example, oil/gas manufacturing and consumption industries, such as oil and gas, transportation, and aerospace and defense, are described by a unique key feature "oil." An eclectic group of industries is characterized by "climate change." One could extrapolate that all industries are concerned about climate change. Energy describes the primary industries of electric power and gas utilities. Interestingly, a large number of industries is associated with "impact." Overall, companies are concerned about the impact of climate change and environmental degradation on their business and on the planet. In addition, companies have been concerned in recent years with the effect of climate change and the environment on financial performance and shareholder value. Some industries are associated with "cost," likely related to the loss and damage incurred due to climate change and an adverse environment (e.g., the negative effect of extreme weather). Heavy emission industries, such as automobile and coal mining, are characterized by "ghg." Generally, our study indicates that the number of companies providing disclosures related to climate change has been relatively stable during the 10 years of this study and that most of the disclosures were typically provided by publicly traded companies.

Scope and Limitations
From a research method perspective, the reproducibility and replication of results are some of the major challenges related to applying machine learning and text analytics to unstructured data. Additionally, despite algorithmic modeling, the intuitive naming of clusters (topics) limits our conclusions somewhat. Still, based on comparisons with other studies, we are confident the findings here are consistent with those from prior research. Thus, the methods developed and applied here lend themselves to future studies in this area. Additional limitations extend to the reliability of disclosure data and/or the validity of data engineering methods. In addition, this is a data-driven study wherein extracted disclosures were subject to machine-learning-based text analytics. Therefore, our research background that summarizes contemporary studies in sustainability-related topics is not subject to a formal method to identify the literature. Rather, several searches were conducted, including in Scopus, to ensure recency of cited studies.
There are limits to the generalizability of the clusters (topics), particularly when examining purely voluntary disclosures and not those of all registered companies. The disclosures may not sufficiently reflect overall corporate initiatives or efforts relating to sustainability. Further, ML algorithm models can only extract limited insight, being constrained by their assumptions. Additionally, this study used the disclosure tool at ceres.org to extract the information based on the assumption that ceres.org accurately gleans sustainability information from various corporate filings with the SEC.

Conclusions and Future Research
This exploratory study attempted to shed light on climate-change-and environmentrelated sustainability issues of companies by studying the sustainability-related disclosures submitted to the SEC in various filings. Our study provided a comprehensive overview of the key climate change and environment topics and issues that companies are concerned with in terms of sustainability. In addition to the dimension-level analysis, the study provided an overview of corporate sustainability disclosure at the sector and industry levels. The research objective was met by the identification of six key clusters using the K-means method and six main topics using the LDA method that related to the disclosure of climate-change-related environmental concerns. As discussed before, both methods produced results that are generally overlapping. The overlapping results further reinforce and enhance our understanding of climate-change-related disclosure at various levels, such as sector, industry, and topic. Our analysis shows that companies are concerned primarily with the topics of gas emission, carbon risk, climate change, loss and damage, renewable energy, and financial impact. While a wider range of topics, such as CSR and ESG, may be included in sustainability, the sustainability disclosures examined in this study predominantly focused on climate-change-and environmental-related issues only. However, many other issues necessitate further research. For example, this study found hardly any information about social, governance, ethics, or human rights topics [90][91][92]. However, recent research is developing new models of sustainability applications and practices and alternative cost-benefit and performance models. For example, Novoselov et al. (2022) discuss social investment approaches versus traditional compensation models in the sustainable development of the Arctic [93,94]. They argue for a "social orientation" that considers the aspirations of the native population. In the case of the Artic, this implies the development of contemporary social infrastructure indigenous to the polar, energy-conscious, and environmentally appropriate dwellings; the promotion of the local business; the nurturing of local tourism; and the protection of cultural traditions and the conservancy of ethnic groups of the Arctic [93,94]. Rather than focusing on firm-level financial performance, the authors [93,94] suggest that the industrialization of the Arctic should dwell on the quality of life by enabling the creation of jobs such that the people do not have to migrate and can live in their own communities and reduce poverty and disease [91,92]. Future research should investigate the linkages between corporate CSR, ESG, and sustainability in the context of the SDGs.
One may also investigate how textual documents such as SEC filings can be used to conduct predictive analytics to get a handle on direction and action. Sentiment analysis can be deployed to explore stakeholders' attitudes and shifts in the long run. The effect of disclosures on the community in general and stakeholders (e.g., climate change activists) can be analyzed from the perspective of social media (public sentiment analysis). Further, as recently emerging studies indicate, the effect of good sustainability practices on corporate performance (e.g., financial, reputation) is another area of potential study. Notwithstanding the limitations, our study contributes to sustainability policy and research in several ways. First, we used data on disclosures published as recently as 2020 and looked at a time frame that includes multiple years. Second, we analyzed a significantly large number of disclosures. By extending the time frame and number of disclosures, we extended the coverage to many divisions, industries, and sectors. This also enabled us to explore inter-industry/inter-sector comparisons. Third, we contributed to the advancement of the ever-increasing application of machine learning and text analytics to understand large corpuses of text data, namely the disclosures. Text analysis offers the novel ability to deep-dive into text data without assuming a pre-defined list of terms. Fourth, our study analyzed corporate sustainability from the company's perspective (i.e., management).
Companies themselves, as well as watchdogs, nonprofits and NGOs, regulatory bodies, and others, will be able to use the findings of this study to make informed decisions, leading to improved sustainability practices. Future studies can continue to analyze disclosure data using alternative ML techniques, such as deep learning, to obtain more robust and richer analysis. For example, prescriptive and discovery analytics can be explored for not just predicting likely outcomes of practices but also developing creative solutions to climate change and other problems.
Other future research directions encompass exploring the phenomenon of sustainability disclosure across industries or across the globe and analyzing the association between sustain-ability and company performance. Research into sustainability is still at a nascent stage, but rapid advances in analytical platforms and tools can accelerate the maturing process.
Author Contributions: All the authors (W.R., S.J.W. and V.R.) contributed equally to conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing, visualization, supervision, and project administration. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data will be made available upon request to the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Industry distribution of companies with disclosures in 2020.

Industry-Division
Count of Industry # of Companies