Directed Topic Extraction with Side Information for Sustainability Analysis

Osipenko, Maria

doi:10.3390/analytics3030021

Open AccessArticle

Directed Topic Extraction with Side Information for Sustainability Analysis

by

Maria Osipenko

School of Business and Economics, Hochschule für Wirtschaft und Recht Berlin, Badensche Strasse 52, 10825 Berlin, Germany

Analytics 2024, 3(3), 389-405; https://doi.org/10.3390/analytics3030021

Submission received: 6 June 2024 / Revised: 8 July 2024 / Accepted: 3 September 2024 / Published: 11 September 2024

(This article belongs to the Special Issue Business Analytics and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Topic analysis represents each document in a text corpus in a low-dimensional latent topic space. In some cases, the desired topic representation is subject to specific requirements or guidelines constituting side information. For instance, sustainability-aware investors might be interested in automatically assessing aspects of firm sustainability based on the textual content of its corporate reports, focusing on the established 17 UN sustainability goals. The main corpus consists of the corporate report texts, while the texts containing the definitions of the 17 UN sustainability goals represent the side information. Under the assumption that both text corpora share a common low-dimensional subspace, we propose representing them in such a space via directed topic extraction using matrix co-factorization. Both the main and the side text corpora are first represented as term–context matrices, which are then jointly decomposed into word–topic and topic–context matrices. The word–topic matrix is common to both text corpora, whereas the topic–context matrices contain specific representations in the shared topic space. A nuisance parameter, which allows us to shift the focus between the error minimization of individual factorization terms, controls the extent to which the side information is taken into account. With our approach, documents from the main and the side corpora can be related to each other in the resulting latent topic space. That is, the corporate reports are represented in the same latent topic space as the descriptions of the 17 UN sustainability goals, enabling a structured automatic sustainability assessment of the textual report’s content. We provide an algorithm for such directed topic extraction and propose techniques for visualizing and interpreting the results.

Keywords:

topic analysis; side information; matrix co-factorization; sustainability analysis

1. Introduction

Sustainable investments have gained significant attention in recent years as both investors and corporations recognize the long-term benefits of integrating environmental, social, and governance (ESG) factors into financial decision-making.

Friede et al. [1] and Eccles et al. [2] provide empirical evidence that companies with strong ESG practices are better positioned to manage risks, adapt to regulatory changes, and capitalize on emerging opportunities, leading to their potentially superior long-term financial performance. These findings highlight the immense importance of transparent ESG assessments.

As the demand for sustainable investments grows, the need for uniform and yet flexible standards, enabling investors to make more informed and reliable comparisons of firms’ sustainability involvements, becomes increasingly critical. The varying definitions and metrics of sustainability can lead to confusion and misinterpretation. Currently, the landscape of sustainability assessments is fragmented, with various frameworks and rating agencies using different criteria and methodologies to evaluate ESG performance. Berg et al. [3] and Chatterji et al. [4] point out the disagreements among these ratings across different rating agencies and their low validity. Such inconsistent sustainability assessments can have profound implications for global sustainability and policies, such as misallocation of investor capital, greenwashing, increasing market volatility, and ineffectiveness of regulatory incentives due to mixed signals (see Chatterji et al. [4] and Berg et al. [3]). In this situation, it is difficult to generate an overview of the ESG development of potential investment firms and make an informed investment decision.

Yet, for about a decade, annual textual sources with ESG-relevant information, where large companies communicate their ESG-related strategies and actions, have been freely available to all types of investors in the form of

Corporate responsibility reports;
Sustainability reports;
Environmental action reports.

Similar sustainability reporting is also available. The intention of such reporting is to increase the transparency and accountability of ESG-related company actions for stakeholders, as noted by Soh [5]. We refer to Gillan et al. [6], who provide historical context and summarize key developments and trends in the field of ESG-related corporate reporting.

Aureli et al. [7] show that sustainability-related disclosures have a significant impact on a company’s value. The contained information influences investor reactions and subsequent pricing, highlighting the increasing importance of incorporating this textual information into sustainability analyses. In this regard, sustainability reporting provides a valuable publicly available textual information source, which can be useful for obtaining ESG ratings. Consequently, the analysis of sustainability-related texts has garnered significant attention from researchers.

To address the specifics of these texts, authors primarily depend on hand-crafted concepts and keywords. Liew et al. [8], for example, use word and phrase frequencies to extract common trends and their importance from sustainability reporting based on sustainability content trees. Using their five content categories and the associated keywords, Landrum and Ohsowski [9] performed a content analysis of sustainability-related corporate reports based on the proportion of contained keywords. Tsalis et al. [10] utilized a scoring system based on disclosure topics from the Global Reporting Initiative to assess the alignment of sustainability-related reports with the 17 UN Sustainable Development Goals (SDGs) (https://sdgs.un.org/goals, accessed on 8 August 2023). Adopted in 2015 by the UN General Assembly, the SDGs represent an intergovernmental set of goals addressing major environmental and social challenges. The aim is to structure information from textual sustainability reports in a way that allows for a meaningful comparison of companies’ contributions to addressing these challenges. The authors developed a catalog of disclosure topics related to the SDGs, leveraging their expertise and previous research. They then used this catalog to manually assign scores to each report, aggregating the scores to provide a comprehensive evaluation.

A common drawback of these works is the extensive use of human expertise in the analysis, which reduces the objectivity of the results on one hand and is time-consuming on the other.

To overcome this limitation, Kang and Kim [11] proposed a fully automated approach to assess textual information in sustainability reports using the SDGs as a reference. They employed a sentence similarity method to evaluate the reports’ relatedness to the goals. However, their approach was computationally intensive because each sentence in a report must be compared to each sentence of an SDG text. Additionally, it does not offer a transparent, low-complexity representation, such as one in a low-dimensional topic space. The authors explained that they rejected classical word frequency-based topic analysis because it cannot incorporate any “predefined theme structure”.

In this paper, we address these limitations by proposing a topic analysis method using co-matrix factorization, which integrates any predefined structure into the analysis. Our method automatically extracts topics from textual sources on sustainability while considering the value system established by the 17 SDGs. This provides a low-dimensional topic representation that is convenient for assessing the association between the SDGs and sustainability reports, ensuring both objectivity and flexibility in ESG assessment.

Topic analysis, a commonly used technique for structuring text data, represents each document in a low-dimensional latent topic space (Churchill and Singh [12]). Popular classical methods include Latent Semantic Analysis (Deerwester et al. [13], Hofmann [14]), Latent Dirichlet Allocation (LDA, Blei et al. [15]), and general-purpose dimension reduction methods like Non-negative Matrix Factorization (NMF, Lee and Seung [16], Vangara et al. [17]), along with their extensions (e.g., Yang and Li [18], Suleman and Korkontzelos [19], Figuera and García Bringas [20]). Recently, deep neural network-based models have also been proposed (Zhao et al. [21]).

Topic extraction for structuring text data has been extensively used in the financial literature. For instance, Li et al. [22] employs LDA to structure financial stability reports, while Chen et al. [23] compares Principal Component Analysis, NMF, LDA, and deep learning models for text analytics in banking. Amini et al. [24] perform automatic topic extraction using common methods specifically for sustainability-related reports. Chen et al. [25] use LDA and neural network-based models to analyze the impact of news on financial markets. For a comprehensive review of text mining and topic analysis in the finance literature, refer to Loughran and McDonald [26] and Gupta et al. [27]. Despite the popularity of LDA, Chen et al. [28] and Egger and Yu [29] argue that NMF can outperform LDA by extracting interpretable topics, especially for short texts. Since our approach involves segmenting the reports into small context pieces, NMF is a promising technique. Additionally, Nugumanova et al. [30] highlights the advantages of NMF-based methods for efficiently extracting domain-specific terms, which is relevant for our sustainability-focused task.

Recently, several LDA-based topic extraction methods that explicitly embed known structures or side information have been proposed. For instance, Harandizadeh et al. [31] combined word2vec embeddings with LDA and vocabulary priors to obtain interpretable word embeddings. Similarly, Eshima et al. [32] embeded prespecified keywords in LDA, and Watanabe and Zhou [33] used seeded LDA with a carefully chosen seeded vocabulary to classify documents into specific categories. These approaches incorporate additional information into topic extraction. However, they require manual intervention for specifying keywords or vocabulary, which can be a drawback. Additionally, Harandizadeh et al. [31] rely on word vectors from a pretrained general-purpose word2vec model, making it unclear whether their model is effective for specific domains like sustainability reports.

On the other hand, some matrix factorization-based approaches integrate side information into dimension reduction. Rao et al. [34] and later Zhang et al. [35] propose integrating side information using graphs, creating a graph-regularized version of matrix factorization with an associated alternating algorithm. However, their side information is not high-dimensional and incorporates only a few individual characteristics forming the basis for the graph links. High-dimensional additional information can be considered through matrix co-factorization techniques, which factorize two or three matrices with common cofactors simultaneously. For example, Fang and Si [36] considers user community information and Luo et al. [37] incorporates tagging and timestamps of ratings in their personalized recommendations via matrix co-factorization. This approach is transparent, easily adjustable, and ensures flexibility by introducing a nuisance parameter that balances error minimization of individual factorization terms.

In this paper, we propose a topic model based on non-negative matrix co-factorization (NMCF) to extract sustainability-related topics from textual sources using the 17 UN goals as side information. Our approach offers fully automated topic extraction without manual keyword searches, interpretability, adaptability through the nuisance parameter

λ

, and a simple, scalable implementation.

This paper is structured as follows. In the next chapter, we explain the methods used and derive the non-negative matrix co-factorization algorithm for topic extraction with side information. We also introduce the data in the form of sustainability-related reporting and the 17 UN goals and describe our preprocessing steps. The results of the application of our algorithm to the data follow. Finally, we conclude and discuss future research directions.

2. Data and Methods

In this section, we introduce our data basis and describe the preprocessing steps. Building upon non-negative matrix factorization and matrix co-factorization, we present our method of non-negative matrix co-factorization as a combination of these techniques and derive the corresponding algorithm for topic extraction with side information.

2.1. Data and Preprocessing

Large listed companies regularly disclose their sustainability-related actions through corporate responsibility reports, sustainability reports, or similar publications. These reports aim to enhance transparency and raise sustainability awareness within the companies. Typically, they contain numerous pages of concise messages about sustainability actions, accompanied by related images.

As Billio et al. [38] found, the largest impact of ESG-ratings divergence is in the Communication Services and Information Technology sectors (For instance, Sustainalytics’ ESG-rating scores for AAPL and DELL are in the upper segment, whereas MSCI assesses AAPL as BBB (medium segment) and DELL as A (upper segment). Consequently, MSCI’s comparative ESG rating prefers DELL, while Sustainalytics rates both firms approximately equal. (Sustainalytics: https://www.sustainalytics.com/esg-rating/, accessed on 4 July 2024; MSCI: https://www.msci.com/our-solutions/esg-investing/esg-ratings-climate-search-tool/, accessed on 4 July 2024)), we selected the top eight listed tech companies—AAPL, AMZN, DELL, GOOG, IBM, INTC, MSFT, and SSU—as the basis for our analysis. We downloaded the reports as PDF files from the companies’ websites. The associated time period spans from 2011 (or later) to 2022, depending on the availability of the reports, with 2011 chosen as the starting point due to the earliest available report year. The following reports were missing: 2011–2014 for MSFT and SSU, 2011–2015 for GOOG, and 2011–2017 for AMZN, IBM, and DELL. If multiple reports covered different ESG aspects for the same firm, they were merged into a single report. All available reports were preprocessed to form the main text corpus of the sustainability reports, with entities or contexts corresponding to individual pages to facilitate further analysis.

Our side information consists of the texts of the 17 UN SDGs, which we obtained from the UN website (https://sdgs.un.org/goals accessed on 8 August 2023). The entities in the side information text corpus are the descriptions of the individual goals.

We first structured our text corpora using a bag-of-words approach (with two-grams as terms) and constructed term–context representations with a pooled vocabulary (all calculations were performed in R (R Core Team [39])). For word-level preprocessing, we used the R-package Quanteda (Benoit et al. [40]) to set up a corpus, tokenize it, and compute the relative frequencies. We omitted terms with relative context frequencies lower than

0.5 %

and higher than

50 %

, as these are either too rare or too frequent to be informative for the analysis.

In the next step, we merged the terms into a unified dictionary, ensuring our bag-of-words representation encompassed all relevant term frequencies. Following preprocessing, we generated two term-context matrices: one representing the corpus of sustainability reports and another representing the structured texts of the UN SDGs. The structured data dimensions are as follows: 5031 report contexts, 17 SDG contexts, and a shared dictionary of 2841 terms.

In the upcoming section, we present the main methods and detail our novel approach for targeted topic extraction using these structured representations of the two text corpora.

2.2. Non-Negative Matrix Co-Factorization for Sustainability Analysis

In this section, we begin by defining the topic decomposition problem and introducing the underlying techniques: non-negative matrix factorization and matrix co-factorization. Subsequently, we develop a new algorithm for non-negative matrix co-factorization tailored for topic extraction.

Our goal is to represent each text source in an interpretable low-dimensional latent topic space guided by side information. In the preprocessing step described earlier, we represented each text corpus as a term–document matrix M, organizing a collection of n documents or contexts with a dictionary of p terms. For each context i and term j, the

i j

th element of M relates to the (weighted) frequency of the jth term in the ith context. This matrix captures the relationship between terms and contexts in a high-dimensional space.

In topic analysis, the resulting decomposition into context–topic and topic–term representations should yield non-negative entries. This characteristic allows for the interpretation of the weights as specific to terms and topics in their respective representations. Consequently, we can analyze topic proportions within documents and interpret topics by examining the highest-weighted words. To maintain the desirable non-negativity of the decomposition, it is essential to employ a matrix factorization method that ensures such a property. This requirement is fulfilled by non-negative matrix factorization.

2.2.1. Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF), introduced in Lee and Seung [41], is a powerful technique used in various fields such as data mining, machine learning, and image processing, including topic extraction. NMF aims to decompose a given matrix M into two non-negative low-rank matrices, U and V, such that:

M \approx U^{⊤} V,

where M is a

p \times n

matrix, U is a

K \times p

matrix, V is a

K \times n

matrix, and the elements of M, U, and V are non-negative. Typically, K is much smaller than p and n.

The core problem of NMF is to find matrices U and V that minimize the difference between M and

U^{⊤} V

, usually measured using the Frobenius norm

| | M - U^{⊤} V {| |}^{2}

. NMF’s resulting decomposition allows us to reveal hidden structures and patterns within the data, making it an invaluable tool for tasks like topic extraction in text mining. However, the inherent non-negativity constraints on U and V, coupled with the large size of data matrices, contribute to its high computational complexity.

In the topic extraction setting:

U is the term–topic matrix, where each column represents a topic, with the values indicating the contribution of each term to that topic. Terms with the highest values in a column are the most representative terms of that topic.
V is the topic–document matrix. The rank K of the factorization is chosen to represent the number of topics. The dominant values in a column show the main topics covered by the corresponding document.

By revealing the hidden thematic structure in text corpora, NMF provides a valuable tool for topic extraction, enabling deeper insights and more effective organization of textual data. NMF has been successfully applied in various text mining tasks, including news topic extraction (Xu et al. [42]), uncovering research topics in academic papers (Greene and Cunningham [43]), analyzing trends and topics in social media posts (Ma et al. [44]), and extracting common themes from customer reviews and feedback (O’Callaghan et al. [45]). It offers the following advantages:

Interpretability: Since NMF ensures that all elements in the matrices U and V are non-negative, the resulting topics and their representations are more interpretable. Each topic can be understood as an additive combination of terms.
Sparsity: NMF often produces sparse matrices, where many elements are zero or close to zero. This sparsity can lead to more distinct and easily interpretable topics.

Among the limitations are:

The need to select the number of topics: The NMF algorithm demands K (the number of topics) as input. Choosing the appropriate number of topics K is challenging and often requires domain knowledge or a data-driven procedure. Too few topics may result in overly broad and indistinct themes, while too many topics can lead to redundant or spurious topics.
Dependency on data representation: The effectiveness of NMF is highly dependent on the quality and nature of the term–document matrix. If the representation of the text data is not suitable (e.g., inadequate tokenization, poor choice of weighting scheme), the resulting topics may be less meaningful. Proper preprocessing and representation of the data are critical, but this dependency adds another layer of complexity to the process.
Scalability: NMF can be computationally intensive, particularly for large-scale datasets. The exact solution of NMF is known to be NP-hard, making it computationally infeasible for large datasets (Vavasis [46]). Consequently, practical approaches focus on approximate solutions using iterative algorithms. The corresponding algorithms, such as multiplicative update rules and various algorithms based on Alternating Least Squares (ALS), differ in their computational complexity and convergence properties (Cichocki and Phan [47]). For instance, the multiplicative update rules, proposed by Lee and Seung [41], involve iterative element-wise operations and matrix multiplications, with a per-iteration complexity of $O (K n p)$ . In contrast, non-negative ALS, which alternates between solving non-negative least squares problems for U and V, has a higher per-iteration complexity due to the need to solve linear systems (see Cichocki and Phan [47]). Despite the higher complexity, ALS often converges faster to a local minimum. A modification of the latter, hierarchical ALS (HALS, introduced in Cichocki et al. [48]), with a lower per-iteration computational complexity of $O (K n p)$ as noted by Hautecoeur et al. [49], has been shown to have even better convergence properties (see, e.g., Gillis and Glineur [50]). Thus, the HALS algorithm offers a computationally efficient and easily implemented basis for our topic extraction algorithm proposed below.

We address the first two limitations of NMF by introducing a data-driven procedure for determining the number of topics and the weighting scheme, and enhance scalability by selecting the HALS algorithm, known for its manageable computational complexity and rapid convergence.

Using NMF with the HALS algorithm provides us with a topic representation of the report text corpus without any predefined structure from the SDGs side information. However, to establish a topic structure aligned with the information found in the descriptions of the SDGs, we need an algorithm capable of incorporating and extracting this common topic structure. To achieve this objective, we propose integrating NMF with matrix co-factorization.

2.2.2. Matrix Co-Factorization

Matrix co-factorization (Koren et al. [51], Fang and Si [36]) is a useful technique in data analysis, typically used in scenarios where multiple data matrices share some common structure. This approach extends the concept of matrix factorization, which aims to decompose a single matrix into a product of two lower-dimensional matrices, to the simultaneous decomposition of multiple matrices. The goal of matrix co-factorization is to find latent factors that can effectively capture the underlying relationships across different data sources, enabling more robust and accurate data analysis.

In practical terms, matrix co-factorization involves decomposing two matrices M and C into a common factor matrix U and individual factor matrices V and Q such that

M \approx U^{⊤} V

and

C \approx U^{⊤} Q

. The shared latent space idea allows for the integration of heterogeneous data, leveraging the complementary information contained in each matrix. The co-factorization methodology perceives the minimization of the joint loss function:

min (| | M - U^{⊤} V | |^{2} + λ | | C - U^{⊤} Q | |^{2}),

balanced by the nuisance parameter

λ

. By minimizing the joint loss, matrix co-factorization extends matrix factorization to simultaneously decompose multiple related matrices, revealing shared latent structures across different data sources.

In a co-factorization framework designed for topic extraction, we expect both text corpora to share a common low-dimensional topic representation U. This representation allows us to compare them using a similarity measure, thereby achieving a topic structure aligned with the information in the SDGs.

Despite its strengths, matrix co-factorization faces several limitations that can affect its performance and practicality. These include higher computational complexity from simultaneous factorization of multiple matrices, scalability challenges, interpretability issues, lack of sparsity, and issues with data imbalance (Fang and Si [36]). To address data imbalance, we adjust the nuisance parameter

λ

in a data-driven manner to balance the combined loss function.

Furthermore, integrating NMF into a co-factorization framework, as proposed in the following subsection, harnesses the strengths of both methods, leading to a promising algorithm.

2.2.3. Non-Negative Matrix Co-Factorization

We now combine the aforementioned methods and introduce Non-negative Matrix Co-factorization (NMCF) for extracting directed topic representations. In contrast to the unrestricted matrix co-factorization, NMCF constrains the elements of U, V, and Q to be non-negative. We develop a HALS-based algorithm tailored for this non-negative co-factorization procedure within our topic extraction framework. This approach enhances our ability to harness the full potential of matrix co-factorization in revealing insights from complex, multi-source data, while ensuring the topic representations remain interpretable through NMF.

Specifically, we define the following model for the term–document matrices derived from the reports and sustainability goals texts:

M = U^{⊤} V + E

and

C = U^{⊤} Q + F

where

M is the (weighted) term–context matrix for the corporate reports with dimensions $(p \times n)$ , where p is the joint vocabulary (words and phrases with two co-occurring words) obtained from both reports and sustainability goals texts. n is the number of corporate report contexts, where the latter represents one page of a corporate report.
C is the (weighted) term–context matrix for the sustainability goals with dimensions $(p \times m)$ , where p is again the joint vocabulary (words and phrases with two co-occurring words) obtained from both reports and sustainability goals texts. m is the number of sustainability goals contexts, where each context represents each of the 17 goals.
U is the term–topic representation matrix with dimensions $(p \times K)$ , where K is the number of common topics and $K \leq min (rank (M), rank (C))$ .
V is the context–topic representation matrix for the reports with dimensions $(K \times n)$ .
Q is the context-topic representation matrix for sustainability goals with dimensions $(K \times m)$ .
E and F are matrices of error terms with dimensions $(p \times n)$ and $(p \times m)$ , respectively.

Also, the elements of U, V, and Q are non-negative.

The associated topic extraction problem is then:

min_{U, V, Q \geq 0 elementwise} (| | M - U^{⊤} {V | |}^{2} + λ | | C - U^{⊤} {Q | |}^{2})

(1)

where

λ

adapts the importance of the loss on the second factorization term (see Figure 1 for a schematic representation of the approach).

The value of

λ

balances the combined loss function, adjusting the impact of the loss components related to the reports and the SDGs. Given that the second dimension of C is much smaller than that of M, the first part of the loss will typically dominate during co-factorization. To prioritize the second part,

λ

can be adjusted accordingly.

Enforcing the non-negativity constraint on U, V, and Q enhances the interpretability of the resulting topics (Kuang et al. [52], Albalawi et al. [53]). Therefore, the minimization is subjected to:

U, V, Q \geq 0 elementwise .

(2)

The algorithm designed to minimize (1) under the constraint (2) utilizes an alternating minimization/alternating projection approach, specifically the Hierarchical Alternating Least Squares (HALS) method by Cichocki et al. [48] (see also Degleris et al. [54]), adapted for our co-factorization framework. The objective function

J (U, V, Q)

for the loss function is given by:

\begin{matrix} J (U, V, Q) & = | | M - U^{⊤} {V | |}^{2} + λ | | C - U^{⊤} Q | | \\ = | | M - \sum_{k = 1}^{K} u_{k} v_{k}^{⊤} {| |}^{2} + λ | | C - \sum_{k = 1}^{K} u_{k} q_{k}^{⊤} | | \\ = | | M - \sum_{k \neq p} u_{k} v_{k}^{⊤} - u_{p} v_{p}^{⊤} {| |}^{2} + λ | | C - \sum_{k \neq p} u_{k} q_{k}^{⊤} - u_{p} q_{p}^{⊤} | | \\ = T r ({(M - \sum_{k \neq p} u_{k} v_{k}^{⊤})}^{⊤} (M - \sum_{k \neq p} u_{k} v_{k}^{⊤}) - 2 (M - \sum_{k \neq p} u_{k} v_{k}^{⊤}) u_{p} v_{p}^{⊤} + u_{p} v_{p}^{⊤} v_{p} u_{p}) \\ + λ T r ({(C - \sum_{k \neq p} u_{k} q_{k}^{⊤})}^{⊤} (C - \sum_{k \neq p} u_{k} q_{k}^{⊤}) - 2 (C - \sum_{k \neq p} u_{k} q_{k}^{⊤}) u_{p} q_{p}^{⊤} + u_{p} q_{p}^{⊤} q_{p} u_{p}), \end{matrix}

where

u_{i}

,

v_{i}

, and

q_{i}

denote the ith column of U, V, Q respectively.

The derivative with respect to

u_{p}

is:

\frac{\partial J (U, V, Q)}{\partial u_{p}} = - 2 (M - \sum_{k \neq p} u_{k} v_{k}^{⊤}) v_{p}^{⊤} + 2 u_{p} v_{p}^{⊤} v_{p} - 2 λ (C - \sum_{k \neq p} u_{k} q_{k}^{⊤}) q_{p}^{⊤} + 2 λ u_{p} q_{p}^{⊤} q_{p} .

Thus, according to the Karush–Kuhn–Tucker conditions for optimality:

u_{p} = max (0, \frac{(M - \sum_{k \neq p} u_{k} v_{k}^{⊤}) v_{p}^{⊤} + λ (C - \sum_{k \neq p} u_{k} q_{k}^{⊤}) q_{p}^{⊤})}{v_{p}^{⊤} v_{p} + λ q_{p}^{⊤} q_{p}}) .

The update rules for

v_{p}

and

q_{p}

remain consistent with the HALS NMF algorithm proposed by Cichocki et al. [48], namely:

v_{p} = max (0, \frac{u_{p} (M - \sum_{k \neq p} u_{k} v_{k}^{⊤})}{u_{p}^{⊤} u_{p}}),

q_{p} = max (0, \frac{u_{p} (C - \sum_{k \neq p} u_{k} q_{k}^{⊤})}{u_{p}^{⊤} u_{p}}) .

The resulting Algorithm 1 is presented below.

Algorithm 1 HALS algorithm for NMCF

Require:

K, λ

while not converged do

for

k = 1

to K do

update

V_{k} \leftarrow max (\frac{U_{k} (M - U_{- k}^{⊤} V_{- k})}{U_{k} U_{k}^{⊤}}, 0)

update

Q_{k} \leftarrow max (\frac{U_{k} (C - U_{- k}^{⊤} Q_{- k})}{U_{k} U_{k}^{⊤}}, 0)

update

U_{k}^{⊤} \leftarrow max (\frac{(M - U_{- k}^{⊤} V_{- k}) V_{k}^{⊤} + λ (C - U_{- k}^{⊤} Q_{- k}) Q_{k}^{⊤}}{V_{k}^{⊤} V_{k} + λ Q_{k}^{⊤} Q_{k}}, 0)

end for

end while

X_{k}

denotes the kth row of the matrix X, and

X_{- k}

denotes the matrix without its kth row.

In summary, for a given K and

λ

, the algorithm yields a unified low-dimensional representation for M and C optimal in the sense of minimizing

J (U, V, Q)

under the non-negativity constraint. U represents a shared latent topic space, while V and Q serve as low-dimensional embeddings for the respective contexts within this topic space. This compact representation of corporate reports alongside SDGs provides a foundation for selecting, evaluating, and monitoring investments with respect to their impact on society and the environment.

The algorithm presented introduces an innovative approach to non-negative matrix co-factorization using the HALS algorithm as its core. It boasts computational efficiency, scalability, and ease of implementation. To our knowledge, such an algorithm has not been explored in the existing literature. Furthermore, the fusion of matrix co-factorization with NMF harnesses the strengths of both methods: comprehensive analysis of multiple data sources and the sparse, interpretable nature inherent to NMF.

When applying this algorithm, it is essential to assume that the information in both text corpora can be effectively represented by their respective term–context matrices, and that these matrices share a coherent, low-dimensional topic structure.

3. Application of NMCF

In this section, we apply the proposed algorithm to the bag-of-words representations of the reports and SDG texts. The NMCF algorithm requires input from two nuisance parameters. The first parameter,

λ

, governs the importance of side information in the co-factorization process. The second parameter, K, determines the number of latent topics and consequently the dimensionality of the latent topic space. We introduce a data-driven approach to select K and

λ

based on maximizing average topic coherence. We then present and visualize the resulting topic representations, demonstrating their utility in sustainability assessment through cosine similarity.

3.1. Tuning the Model

To implement NMCF using Algorithm 1, we need to specify the number of topics K (which defines the dimension of the latent topic space) and the nuisance parameter

λ

for the loss function. Additionally, there are several weighting schemes available for the term-context matrix. In this section, we explore various weighting schemes and propose a data-driven approach to simultaneously select K and

λ

within their plausible ranges, aiming to maximize topic coherence.

Topic coherence is a widely used metric for assessing the semantic quality of topics based on word co-occurrence (Thompson and Mimno [55], Selivanov et al. [56]). According to Gurdiel et al. [57], coherence-driven selection of topic numbers results in topics that are more interpretable for humans. Specifically, we utilize the average mean–logratio topic coherence based on the internal text corpora of the reports and the SDGs.

The logarithmic coherence

c o h_{k}

for a topic k with m top words

w_{k, 1}, \dots, w_{k, m}

,

c o h_{k}

is defined as:

c o h_{k} = \sum_{i = 1}^{m} \sum_{j < i} log \frac{# (w_{k, i}, w_{k, j})}{# (w_{k, i})} + ε .

(3)

Here,

# (\cdot)

counts the contexts containing the input (a word or a word pair) and

ε

is a smoothing parameter. This metric quantifies how frequently the top m words in a topic k co-occur within the reference text corpus. It is grounded in the observation that words with similar meanings tend to appear together in the same contexts. Thus, topic coherence correlates positively with interpretability.

To determine the optimal values for the nuisance parameters that maximize topic coherence, we explore meaningful combinations of K and

λ

. We consider

K = 5, \dots, 15

and

λ \in [0, 700]

. Applying the NMCF algorithm to our data, we compute the logarithmic coherence for each topic as defined in (3). Subsequently, we calculate the average coherence across all topics:

\bar{c o h} = \frac{1}{k} \sum_{k = 1}^{K} c o h_{k} .

(4)

As weighting alternatives, we consider:

Counts of term i in context j, $t f_{i j}$ , (labelled as “none”);
Counts weighted by total frequency, $t f_{i j} / \sum_{j} t f_{i j}$ (labelled as “tf”);
Counts weighted by total inverse frequency (labelled as “tf-idf”);
Logarithms of the counts, computed as $1 + {log}_{10} (t f_{i j})$ for $t f_{i j} > 0$ and zero else (labelled as “logcount”);
Logarithms of the counts standardized by average logcounts, computed as $\frac{1 + {log}_{10} (t f_{i j})}{1 + log (\bar{t f_{i j}})}$ (labelled as “logave”).

For each weighting scheme above, we select the combination of K and

λ

that yields the highest average logarithmic coherence computed using (4). The optimal parameter values of K and

λ

for each weighting scheme are presented in Table 1. Subsequently, we determine the weighting scheme along with the specific K and

λ

combination that achieves the highest average log coherence.

Based on the findings from Table 1, the logarithmic weighting schemes consistently demonstrate superior results in terms of average coherence. The highest average coherence score is achieved using logarithmic counts in the term–context matrices, with

K = 6

topics and

λ = 390 .

Therefore, we adopt this parameter combination for our subsequent analysis.

3.2. Comparing the Optimized Model with a Competing Technique: Keyword Seeded LDA

In this section, we evaluate the performance of our model against a competitive technique known as keyword seeded topic model (keyATM) introduced in Eshima et al. [32]. This Bayesian method excels in extracting topics with a specific focus by leveraging user-specified topic-specific keywords using the topic prior distributions. However, since our approach utilizes SDG texts as side information without predefined topic keywords, we adopt a two-stage procedure to create an analogous comparison:

First, we employ the classical LDA model on the SDG texts to extract topic keywords. These keywords comprise the top words for each topic extracted from the SDGs.
Next, we input these extracted keywords into the keyATM to generate keyword-assisted topics.

Thus, in the initial stage, we fit the classical LDA model using only the SDG texts. Determining the number of topics K for this model is crucial. Based on the average topic coherence criterion described in Equation (4) from the previous section, we find the optimal number of topics for this LDA model to be

K = 6

. The top 10 words for each extracted topic constitute our keyword set for the subsequent stage. The resulting keywords are summarized in Table 2.

In addition to keyword topics, keyATM incorporates a user-specified number of topics without predefined keywords, which must also be provided. Using the top words for each LDA topic from the first stage, we fit a keyword topic model that includes six keyword topics and a variable number (ranging from zero to four) of additional topics without keywords. Subsequently, we compute the average coherence for the resulting models. The results are presented in Table 3. The best achieved average topic coherence is much lower than that reported by our proposed model, demonstrating its advantage.

It is important to note that we did not optimize any parameters of the keyATM priors in this evaluation. Optimizing these priors could potentially enhance the results. Nevertheless, in this competitive approach, calibration is essential for both the initial LDA model and the subsequent keyATM model to achieve effective topic extraction with integrated side information from the SDGs. Furthermore, one must consider the additional uncertainty introduced by such a “plug-in” estimator. In contrast, our proposed model involves calibrating the nuisance parameters only once and it performs the decomposition in a single stage.

3.3. Interpreting the Best NMCF Model

The output of Algorithm 1 consists of the decomposition matrices

V, U

and Q. Matrix U contains the term–topic representations. By examining the largest entries of U and the corresponding terms (topwords), we can interpret the resulting latent topics. The entries of V and Q, along with their relative magnitudes, reveal the proportions and the importance of the topics in the text corpus.

Figure 2 shows the topic proportions and the top words per topic for each discovered topic in both the reports and the SDG texts. The top five words shown in Figure 2 provide sufficient insight for topic interpretation. There is a noticeable difference in topic distribution between the reports and the SDG texts. For instance, the topic “industri complet found local innov” appears prominently in both distributions, while “communic qualiti continu corpor right” and “water biodiv affect protect prevent” dominate the reports and SDGs, respectively. This topic proportion representation facilitates the discovery of new action areas for companies.

In essence, the entries of V and Q furnish the k-dimensional context–topic representations, facilitating comparison of underlying contexts in a reduced-topic space. Using the derived representations of corporate reports and SDGs, the subsequent subsection examines strategies for selecting, evaluating, and monitoring investments based on their societal and environmental impacts.

3.4. Associating the Reports with the SDGs

We apply a cosine similarity measure, widely used in text analysis, to associate the reports with the SDGs. While other dissimilarity measures are possible with our focused topic embeddings, cosine similarity typically outperforms them in text comparison tasks (see, for example, Alobed et al. [58]). Therefore, using cosine similarity aligns our association analysis with mainstream text mining practices.

Given that each report combines multiple contexts, each represented in a six-dimensional topic space, we aggregate context-based similarity measures to the report level. Specifically, we use the maximum similarity score across all contexts within each report.

To visualize the resulting similarities (Figure 3), we link the contents of each report to the SDGs based on maximum cosine similarity. This involves computing the cosine similarity between each context of a report and each SDG, then selecting the maximum similarity score across all contexts of the report as the overall similarity measure.

Note that the similarity measures in Figure 3 are computed for each company–year, enabling dynamic analysis of the underlying SDG-related content and the evolution of sustainability actions taken by companies over time.

For a static analysis, we can compute the average of the cosine similarities over the all available report years. This allows us to establish a similarity-based rating for the considered firms with respect to each SDG. The resulting ratings are presented in Table 4. To illustrate the results, consider an investor who wants to allocate an investment either in DELL or AAPL. According to Table 4, an investor prioritizing Goals 2, 8, 11-12, and 14 would prefer to invest in AAPL over DELL. An investor prioritizing any other SDGs would prefer to invest in DELL.

Our framework is not limited to associating reports with individual SDGs alone. We can also consider linear combinations of goals based on personal preferences. This allows for tailored sustainability assessments. For instance, we define a linear combination of goals using weights

β = {(β_{1}, \dots, β_{17})}^{⊤}

. Then,

C β \approx U Q^{⊤} β

defines a “personalized” goal based on term occurrences approximated through co-factorization. We provide examples of four different SDG portfolios to illustrate this tailored approach to sustainability assessment.

Our example portfolios are:

“all_equal” (all goals equally weighted);
“basic_needs” (goals addressing basic human needs (SDGs 1-6) equally weighted, with zero weights for all other goals);
“fair_society” (goals concerning society and infrastructure development (SDGs 7-12 and 16-17) equally weighted, with zero weights for all other goals);
“climate_life” (the goals addressing climate, plant, and animal life (SDGs 13-15) equally weighted, with zero weights for all other goals).

In Table 5, we present the firm ratings based on different SDG portfolios. Consider again an investor deciding between AAPL and DELL based on individual SDG preferences. If the investor weights all SDGs (“all_equal”), according to the ratings in Table 5, the preference would lean towards AAPL over DELL. However, if their preferences prioritize “basic_needs”, they would prefer to invest in DELL over AAPL.

As illustrated, the ratings can vary significantly depending on specific preferences. Generally, any linear combination of goals can serve as a basis for comparison, highlighting the flexibility of our approach. Furthermore, any user-defined (dis)similarity metric can be applied to the resulting embeddings in the topic space, adding further flexibility to our method.

In summary, as demonstrated in the above analysis, our proposed matrix co-factorization for sustainability assessment, structured around the 17 SDGs, offers a transparent and adaptable approach. It provides a low-dimensional topic representation that facilitates the dynamic association of sustainability-related reports with the SDGs.

4. Conclusions and Discussion

In this paper, we propose a transparent approach for representing the textual content of sustainability reports within a topic space defined by the 17 SDGs. Our methodology leverages non-negative matrix co-factorization for topic extraction with side information, yielding a low-dimensional representation aligned with a predefined topic structure. This method is scalable, straightforward to implement, computationally efficient, and operates without the need for manual intervention, distinguishing it from comparable methods. It yields interpretable results suitable for various applications.

Our approach involves jointly factorizing two term–context matrices: one containing term–context counts from sustainability-related reports and the other from SDG texts, representing predefined structural information. The associated algorithm, based on hierarchical NMF, requires two nuisance parameters. The first parameter,

λ

, controls the importance of side information in the co-factorization process. The second parameter, K, determines the number of latent topics and the resulting dimensionality of the topic space. We evaluate multiple weighting schemes for term–context representations and propose a data-driven procedure for selecting optimal parameter values based on maximizing average topic coherence, a common metric for unsupervised topic extraction.

Using average topic coherence as our criterion, we compare our method to a comparable competitor, the keyword-seeded topic model by Eshima et al. [32]. Our results demonstrate superior performance in terms of average topic coherence, highlighting our method’s computational simplicity and efficacy in directed topic extraction.

Our methodological approach offers significant advantages in terms of simplicity, scalability, transparency, and interpretability. While parameter calibration and weighting scheme selection are necessary, we advocate a cross-validation approach based on optimizing topic coherence to guide these decisions. This approach, while computationally intensive, ensures robust parameter choices for comprehensive analysis.

The resulting SDG-directed contextual topic embeddings enable dynamic comparisons of sustainability-related reports across eight tech firms. By associating these reports with SDGs using the maximum cosine similarity of their embeddings, we illustrate how our methodology can effectively support financial decisions aligned with tailored SDG-based investor preferences. Ultimately, investors and stakeholders can leverage our ESG assessment methodology to gain confidence in ESG ratings that adhere to a consistent value system aligned with the SDGs.

However, a critical assumption of our analysis is the objectivity of the information within sustainability reports, which may not always hold true. Laskin and Nesova [59] discuss the issue of credibility and optimism bias in sustainability reporting. Moreover, we acknowledge the omission of sentiment analysis (positive or negative tone) in our sustainability assessment, which is crucial for a comprehensive evaluation (Mućko [60]). Addressing these aspects presents promising directions for future research.

Funding

This research received no external funding.

Data Availability Statement

Data are available at https://github.com/omanya/sustainability_dimensions (accessed on 4 April 2024).

Conflicts of Interest

The author declares no conflicts of interest.

References

Friede, G.; Busch, T.; Bassen, A. ESG and financial performance: Aggregated evidence from more than 2000 empirical studies. J. Sustain. Financ. Investig. 2015, 5, 210–233. [Google Scholar] [CrossRef]
Eccles, R.G.; Ioannou, I.; Serafeim, G. The Impact of Corporate Sustainability on Organizational Processes and Performance. Manag. Sci. 2014, 60, 2835–2857. [Google Scholar] [CrossRef]
Berg, F.; Kölbel, J.F.; Rigobon, R. Aggregate Confusion: The Divergence of ESG Ratings. Rev. Financ. 2022, 26, 1315–1344. [Google Scholar] [CrossRef]
Chatterji, A.K.; Durand, R.; Levine, D.I.; Touboul, S. Do ratings of firms converge? Implications for managers, investors and strategy researchers. Strateg. Manag. J. 2016, 37, 1597–1614. [Google Scholar] [CrossRef]
Soh, D.S.B. Sustainability Reporting and Assurance: A Historical Analysis on a World-Wide Phenomenon. Soc. Environ. Account. J. 2014, 34, 125. [Google Scholar] [CrossRef]
Gillan, S.L.; Koch, A.; Starks, L.T. Firms and social responsibility: A review of ESG and CSR research in corporate finance. J. Corp. Financ. 2021, 66, 101889. [Google Scholar] [CrossRef]
Aureli, S.; Gigli, S.; Medei, R.; Supino, E. The value relevance of environmental, social, and governance disclosure: Evidence from Dow Jones Sustainability World Index listed companies. Corp. Soc. Responsib. Environ. Manag. 2020, 27, 43–52. [Google Scholar] [CrossRef]
Liew, W.T.; Adhitya, A.; Srinivasan, R. Sustainability trends in the process industries: A text mining-based analysis. Comput. Ind. 2014, 65, 393–400. [Google Scholar] [CrossRef]
Landrum, N.; Ohsowski, B. Identifying Worldviews on Corporate Sustainability: A Content Analysis of Corporate Sustainability Reports. Bus. Strategy Environ. 2017, 27, 128–151. [Google Scholar] [CrossRef]
Tsalis, T.A.; Malamateniou, K.E.; Koulouriotis, D.; Nikolaou, I.E. New challenges for corporate sustainability reporting: United Nations’ 2030 Agenda for sustainable development and the sustainable development goals. Corp. Soc. Responsib. Environ. Manag. 2020, 27, 1617–1629. [Google Scholar] [CrossRef]
Kang, H.; Kim, J. Analyzing and Visualizing Text Information in Corporate Sustainability Reports Using Natural Language Processing Methods. Appl. Sci. 2022, 12, 5614. [Google Scholar] [CrossRef]
Churchill, R.; Singh, L. The Evolution of Topic Modeling. ACM Comput. Surv. 2022, 54, 215. [Google Scholar] [CrossRef]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; ACM: New York, NY, USA, 1999; pp. 50–57. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Lee, D.; Seung, H.S. Algorithms for Non-negative Matrix Factorization. In Advances in Neural Information Processing Systems; Leen, T., Dietterich, T., Tresp, V., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 13. [Google Scholar]
Vangara, R.; Skau, E.; Chennupati, G.; Djidjev, H.; Tierney, T.; Smith, J.P.; Bhattarai, M.; Stanev, V.G.; Alexandrov, B.S. Semantic Nonnegative Matrix Factorization with Automatic Model Determination for Topic Modeling. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 328–335. [Google Scholar] [CrossRef]
Yang, Q.; Li, W. The LDA Topic Model Extension Study. In Proceedings of the International Conference on Logistics, Engineering, Management and Computer Science, Shenyang, China, 29–31 July 2015; Atlantis Press: Dordrecht, The Netherlands, 2015; pp. 857–860. [Google Scholar] [CrossRef]
Suleman, R.M.; Korkontzelos, I. Extending latent semantic analysis to manage its syntactic blindness. Expert Syst. Appl. 2021, 165, 114130. [Google Scholar] [CrossRef]
Figuera, P.; García Bringas, P. Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and Insights. Technologies 2024, 12, 5. [Google Scholar] [CrossRef]
Zhao, H.; Phung, D.Q.; Huynh, V.; Jin, Y.; Du, L.; Buntine, W.L. Topic Modelling Meets Deep Neural Networks: A Survey. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–26 August 2021. [Google Scholar]
Li, G.; Zhu, X.; Wang, J.; Wu, D.; Li, J. Using LDA Model to Quantify and Visualize Textual Financial Stability Report. Procedia Comput. Sci. 2017, 122, 370–376. [Google Scholar] [CrossRef]
Chen, Y.; Rabbani, R.M.; Gupta, A.; Zaki, M.J. Comparative text analytics via topic modeling in banking. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar] [CrossRef]
Amini, M.; Bienstock, C.C.; Narcum, J.A. Status of corporate sustainability: A content analysis of Fortune 500 companies. Bus. Strategy Environ. 2018, 27, 1450–1461. [Google Scholar] [CrossRef]
Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
Loughran, T.; McDonald, B. Textual Analysis in Accounting and Finance: A Survey. J. Account. Res. 2016, 54, 1187–1230. [Google Scholar] [CrossRef]
Gupta, A.; Dengre, V.; Kheruwala, H.A.; Shah, M. Comprehensive review of text-mining applications in finance. Financ. Innov. 2020, 6, 1–25. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowl.-Based Syst. 2019, 163, 1–13. [Google Scholar] [CrossRef]
Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Soc. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
Nugumanova, A.; Akhmed-Zaki, D.; Mansurova, M.; Baiburin, Y.; Maulit, A. NMF-based approach to automatic term extraction. Expert Syst. Appl. 2022, 199, 117179. [Google Scholar] [CrossRef]
Harandizadeh, B.; Priniski, J.H.; Morstatter, F. Keyword Assisted Embedded Topic Model. In WSDM′22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event, AZ, USA, 21–25 February 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
Eshima, S.; Imai, K.; Sasaki, T. Keyword-Assisted Topic Models. Am. J. Political Sci. 2024, 68, 730–750. [Google Scholar] [CrossRef]
Watanabe, K.; Zhou, Y. Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. Soc. Sci. Comput. Rev. 2022, 40, 346–366. [Google Scholar] [CrossRef]
Rao, N.; Yu, H.F.; Ravikumar, P.K.; Dhillon, I.S. Collaborative Filtering with Graph Information: Consistency and Scalable Methods. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Zhang, Y.; Yun, Y.; Dai, H.; Cui, J.; Shang, X. Graphs Regularized Robust Matrix Factorization and Its Application on Student Grade Prediction. Appl. Sci. 2020, 10, 1755. [Google Scholar] [CrossRef]
Fang, Y.; Si, L. Matrix co-factorization for recommendation with rich side information and implicit feedback. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, Chicago, IL, USA, 27 October 2011; ACM: New York, NY, USA, 2011; pp. 1165–1169. [Google Scholar] [CrossRef]
Luo, L.; Xie, H.; Rao, Y.; Wang, F.L. Personalized recommendation by matrix co-factorization with tags and time information. Expert Syst. Appl. 2019, 119, 311–321. [Google Scholar] [CrossRef]
Billio, M.; Costola, M.; Hristova, I.; Latino, C.; Pelizzon, L. Inside the ESG ratings: (Dis)agreement and performance. Corp. Soc. Responsib. Environ. Manag. 2021, 28, 1426–1445. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
Benoit, K.; Watanabe, K.; Wang, H.; Nulty, P.; Obeng, A.; Müller, S.; Matsuo, A. quanteda: An R package for the quantitative analysis of textual data. J. Open Source Softw. 2018, 3, 774. [Google Scholar] [CrossRef]
Lee, D.; Seung, H. Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Liu, X.; Gong, Y. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, ON, Canada, 28 July–1 August 2003; ACM: New York, NY, USA, 2003; pp. 267–273. [Google Scholar] [CrossRef]
Greene, D.; Cunningham, P. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, NY, USA, 2006; pp. 377–384. [Google Scholar] [CrossRef]
Ma, H.; King, I.; Lyu, M.R. Mining Web Graphs for Recommendations. IEEE Trans. Knowl. Data Eng. 2012, 24, 1051–1064. [Google Scholar] [CrossRef]
O’Callaghan, D.; Greene, D.; Carthy, J.; Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 2015, 42, 5645–5657. [Google Scholar] [CrossRef]
Vavasis, S.A. On the Complexity of Nonnegative Matrix Factorization. SIAM J. Optim. 2009, 20, 1364–1377. [Google Scholar] [CrossRef]
Cichocki, A.; Phan, A.H. Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations. IEICE Trans. 2009, 92-A, 708–721. [Google Scholar] [CrossRef]
Cichocki, A.; Zdunek, R.; Amari, S.i. Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization. In Independent Component Analysis and Signal Separation, Proceedings of the 7th International Conference, ICA 2007, London, UK, 9–12 September 2007; Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 169–176. [Google Scholar]
Hautecoeur, C.; De Lathauwer, L.; Gillis, N.; Glineur, F. Least-Squares Methods for Nonnegative Matrix Factorization Over Rational Functions. IEEE Trans. Signal Process. 2023, 71, 1712–1724. [Google Scholar] [CrossRef]
Gillis, N.; Glineur, F. Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization. Neural Comput. 2011, 24, 1085–1105. [Google Scholar] [CrossRef]
Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Kuang, D.; Choo, J.; Park, H. Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering. In Partitional Clustering Algorithms; Celebi, M.E., Ed.; Springer International Publishing: Cham, Switzerland, 2015; pp. 215–243. [Google Scholar] [CrossRef]
Albalawi, R.; Yeap, T.H.; Benyoucef, M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Front. Artif. Intell. 2020, 3, 42. [Google Scholar] [CrossRef]
Degleris, A.; Antin, B.; Ganguli, S.; Williams, A.H. Fast Convolutive Nonnegative Matrix Factorization through Coordinate and Block Coordinate Updates. arXiv 2019, arXiv:1907.00139. [Google Scholar]
Thompson, L.; Mimno, D. Authorless Topic Models: Biasing Models Away from Known Structure. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; Bender, E.M., Derczynski, L., Isabelle, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 3903–3914. [Google Scholar]
Selivanov, D.; Bickel, M.; Wang, Q. text2vec: Modern Text Mining Framework for R, R package version 0.6.3; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
Gurdiel, L.; Morales Mediano, J.; Cifuentes Quintero, J. A comparison study between coherence and perplexity for determining the number of topics in practitioners interviews analysis. In Proceedings of the IV Iberoamerican Conference of Young Researchers in Economy and Management, Madrid, Spain, 16–17 December 2021. [Google Scholar]
Alobed, M.; Altrad, A.M.M.; Bakar, Z.B.A. A Comparative Analysis of Euclidean, Jaccard and Cosine Similarity Measure and Arabic Wordnet for Automated Arabic Essay Scoring. In Proceedings of the 2021 Fifth International Conference on Information Retrieval and Knowledge Management (CAMP), Kuala Lumpur, Malaysia, 15–16 June 2021; pp. 70–74. [Google Scholar] [CrossRef]
Laskin, A.V.; Nesova, N.M. The Language of Optimism in Corporate Sustainability Reports: A Computerized Content Analysis. Bus. Prof. Commun. Q. 2022, 85, 80–98. [Google Scholar] [CrossRef]
Mućko, P. Sentiment analysis of CSR disclosures in annual reports of EU companies. Procedia Comput. Sci. 2021, 192, 3351–3359. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of the proposed matrix co-factorization method.

Figure 2. Topic proportions and the top-weighted words per topic for each discovered topic in the reports (top) and in the SDG texts (bottom).

Figure 3. Similarity measures between the reports across available company–years (rows, with earlier years starting at the bottom) and the SDGs (columns), computed using the resulting topic embeddings with maximum cosine similarity. Darker colors indicate higher similarity values.

Table 1. The optimal parameter values for K and

λ

across different weighting schemes were determined using a grid-search algorithm to maximize the average coherence.

Table 1. The optimal parameter values for K and

λ

across different weighting schemes were determined using a grid-search algorithm to maximize the average coherence.

Weighting	$λ$	K	$\bar{coh}$ (Reports)	$\bar{coh}$ (SDGs)	$\bar{coh}$ (All)
none	334	8	−2.62348	−0.94501	−1.78425
tf	660	8	−2.25165	−1.58800	−1.91982
tf-idf	346	15	−6.09706	−2.04164	−4.06935
logcount	390	6	−2.40807	−0.64715	−1.52761
logave	432	6	−2.42982	−0.64560	−1.53771

Table 2. The keywords (top 10 topic words) associated with the six topics extracted by LDA.

Category	Keywords
topic 1	food, ecosystem, sourc, agricultur, land, protect, effici, natur, suppli, system
topic 2	water, employ, innov, guidelin, work, overview, institut, labor, local, growth
topic 3	sector, complet, benefit, inform, disclosur, consumpt, base, wast, least, solut
topic 4	poverti, infrastructur, inclus, public, financ, measur, industri, overview, may, world
topic 5	health, women, right, opportun, qualiti, compani, medicin, found, men, care
topic 6	build, climat, resili, marin, afford, integr, ocean, plan, transport, solut

Table 3. The average coherence results for the six keyword topics achieved by keyATM.

Total Number of Topics	$\bar{coh}$ (All)	$\bar{coh}$ (Reports)	$\bar{coh}$ (SDGs)
6	−4.35951	−1.20522	−7.51379
7	−4.17859	−1.12814	−6.86721
8	−4.26095	−1.10142	−7.74990
9	−4.24789	−1.19756	−7.21985
10	−4.28675	−1.14048	−7.74389

Table 4. Company ratings based on similarity (from most to least similar) to individual SDGs using the aggregated topic embeddings across all available report years.

Goal	Rating
G1	SSU, AMZN, DELL, IBM, AAPL, INTC, MSFT, GOOG
G2	AAPL, AMZN, GOOG, IBM, SSU, INTC, DELL, MSFT
G3	AMZN, AAPL, SSU, IBM, MSFT, DELL, INTC, GOOG
G4	AMZN, INTC, IBM, SSU, MSFT, DELL, AAPL, GOOG
G5	SSU, INTC, IBM, AMZN, MSFT, DELL, AAPL, GOOG
G6	IBM, SSU, AMZN, INTC, DELL, AAPL, MSFT, GOOG
G7	SSU, AMZN, IBM, DELL, GOOG, AAPL, MSFT, INTC
G8	SSU, AMZN, IBM, AAPL, DELL, INTC, MSFT, GOOG
G9	SSU, AMZN, IBM, INTC, DELL, AAPL, MSFT, GOOG
G10	SSU, DELL, AMZN, AAPL, MSFT, INTC, IBM, GOOG
G11	SSU, AMZN, IBM, INTC, AAPL, MSFT, DELL, GOOG
G12	AAPL, IBM, SSU, GOOG, AMZN, INTC, DELL, MSFT
G13	AMZN, SSU, IBM, DELL, GOOG, AAPL, INTC, MSFT
G14	SSU, IBM, INTC, AAPL, MSFT, AMZN, DELL, GOOG
G15	IBM, INTC, AMZN, DELL, SSU, GOOG, AAPL, MSFT
G16	SSU, IBM, AMZN, INTC, MSFT, DELL, AAPL, GOOG
G17	DELL, AMZN, SSU, INTC, AAPL, IBM, MSFT, GOOG

Table 5. Company similarity-based rating (from most to least similar) with respect to the individual SDGs using the obtained topic embeddings for the reports in the year 2020.

Goal	Rating
all_equal	SSU, INTC, MSFT, AMZN, IBM, AAPL, DELL, GOOG
basic_needs	AMZN, INTC, SSU, MSFT, IBM, DELL, AAPL, GOOG
fair_society	INTC, MSFT, SSU, IBM, AAPL, AMZN, DELL, GOOG
climate_life	SSU, INTC, AMZN, AAPL, MSFT, IBM, GOOG, DELL

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Osipenko, M. Directed Topic Extraction with Side Information for Sustainability Analysis. Analytics 2024, 3, 389-405. https://doi.org/10.3390/analytics3030021

AMA Style

Osipenko M. Directed Topic Extraction with Side Information for Sustainability Analysis. Analytics. 2024; 3(3):389-405. https://doi.org/10.3390/analytics3030021

Chicago/Turabian Style

Osipenko, Maria. 2024. "Directed Topic Extraction with Side Information for Sustainability Analysis" Analytics 3, no. 3: 389-405. https://doi.org/10.3390/analytics3030021

APA Style

Osipenko, M. (2024). Directed Topic Extraction with Side Information for Sustainability Analysis. Analytics, 3(3), 389-405. https://doi.org/10.3390/analytics3030021

Article Menu

Directed Topic Extraction with Side Information for Sustainability Analysis

Abstract

1. Introduction

2. Data and Methods

2.1. Data and Preprocessing

2.2. Non-Negative Matrix Co-Factorization for Sustainability Analysis

2.2.1. Non-Negative Matrix Factorization

2.2.2. Matrix Co-Factorization

2.2.3. Non-Negative Matrix Co-Factorization

3. Application of NMCF

3.1. Tuning the Model

3.2. Comparing the Optimized Model with a Competing Technique: Keyword Seeded LDA

3.3. Interpreting the Best NMCF Model

3.4. Associating the Reports with the SDGs

4. Conclusions and Discussion

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI