Knowledge Graph Extraction of Business Interactions from News Text for Business Networking Analysis

: Network representation of data is key to a variety of fields and their applications including trading and business. A major source of data that can be used to build insightful networks is the abundant amount of unstructured text data available through the web. The efforts to turn unstructured text data into a network have spawned different research endeavors, including the simplification of the process. This study presents the design and implementation of TraCER, a pipeline that turns unstructured text data into a graph, targeting the business networking domain. It describes the application of natural language processing techniques used to process the text, as well as the heuristics and learning algorithms that categorize the nodes and the links. The study also presents some simple yet efficient methods for the entity-linking and relation classification steps of the pipeline.


Introduction
The field of machine learning has seen significant advancements in algorithms for graph analysis, which have proven effective in tasks such as node classification, link prediction, and graph classification [1].This has sparked a renewed interest in graph theory applications across diverse fields notably network science and knowledge graphs (KGs) [2], which are pivotal in areas ranging from molecular biology to social network analysis.The ubiquity of graph data structures and the topological advantage they present, as emphasized by Bronstein et al. [3], makes them suitable for deep-learning models that consider the geometry of data with irregular topologies.
Molecular biology, social networks, recommender systems, and question-answering are a few examples of the domains that benefit from deep-learning models trained on graph data structures.For instance, molecular biology leverages deep-learning models trained on graph data structures to enable analysis of protein-protein interaction networks [4].In social networks, graph-based datasets facilitate community and anomaly detection [5,6].Recommender systems harness graph-based models to improve personalized recommendations [7].Additionally, question-answering benefits from the knowledge graph representations of data by capturing semantic relationships between entities [8].Another domain that can greatly benefit from deep learning on graph-structured datasets is the study of business networking.Graph-structured datasets can improve business networking in many ways, including relationship analysis of connections between businesses to uncover potential opportunities for partnerships or collaborations and community detection to identify entities that share common interests.Business networking analyses are usually conducted from the perspective of an organization to understand the relationships with customers and suppliers.Leveraging graph-structured data will enable market-level business network analysis, encompassing cross-market business networking.This will provide business decision-makers with unprecedented insights.
Yet, a notable gap persists in the availability of graph datasets tailored for business networking analysis, particularly in benchmark datasets such as the Open Graph Benchmark [9].Business news articles, press releases, and industry publications can be valuable for tracking business events, partnerships, mergers, and other networking activities.They provide timely and current information, which is crucial for understanding the latest developments in the business world and how they impact networking relationships.Most news articles are publicly accessible through the web, cover a diverse range of industries, and are expected to be unbiased.It is worthwhile to have tools that seamlessly generate graph-structured datasets from business news articles for the business networking domain to take advantage of advances in the graph-learning field.
This study introduces TraCER (Trading Content Extraction and Representation), a streamlined and efficient pipeline designed to extract and represent content from unstructured text data.The primary objective of TraCER is to generate a graph representation dataset tailored for the analysis of business networking.Constructing networks from unstructured textual information presents unique challenges, with the foremost being the identification of entity references and their interconnected relationships.While the literature contains various studies and tools dedicated to distinct aspects of automatic knowledge graph construction from text, such as named entity recognition [10] and relation extraction [11], there is a scarcity of integrated approaches that seamlessly transition from textual data to graph creation.Moreover, to the best of our knowledge, no such automated pipeline has been specifically designed for the domain of business networking.This study's contribution, therefore, seeks to fill this gap by addressing the absence of domain-specific graph datasets for business networking.We achieve this by providing an integrated toolchain that enables the generation of business networking graph datasets from business news sources.The tools comprising TraCER are openly accessible on the web (https://github.com/semlab/tracerand https://github.com/semlab/triplex,accessed on 27 December 2023).

Building a Knowledge Graph from Text
Transforming textual data into network representations is commonly a challenging endeavor.Typically, networks are constructed using data sourced from knowledge bases.The creation of these knowledge bases often involves either an extensive and meticulous process of data collection and curation, carried out by experts in the domain, or through an automated compilation of diverse and external structured data sources.Consequently, the development of a network dataset tends to be either a labour and time-intensive task or requires a sophisticated combination of tools to extract and combine the pre-existing structured data.
The representation of data as a graph was initially carried out manually.As such, the Zachary karate club [12] was put together as a network representation.It is a small dataset of 34 nodes, which is frequently used as a benchmark dataset, including in contemporary studies [13,14].Considerable human labor went into building knowledge graphs at scale.Some examples include WordNet, a lexical database of semantic relations between words [15], and Cyc, which attempt to model basic concepts and rules of the world to be used as common sense knowledge by semantic reasoners [16].The manual approach quickly turned out to be impractical at scale.This led researchers to introduce automatism in building knowledge graphs using structured knowledge bases.An example is Cite-Seer [17], an index for academic literature that is autonomously built from citation sections of academic documents.As information retrieval methodologies improved, and catalogs of human knowledge became available on the web; large-scale KGs were built from information extracted from semi-structured information catalogs, notably Wikipedia.Those KGs include DBPedia and YAGO.In addition, as natural language processing methods are used for information extraction from natural text, the automation in the construction of the KG increased as well.The automatic construction of knowledge graphs involves many steps that are automated in isolation.Zhong et al. [18] surveyed procedures used to automate different steps of knowledge graph construction, in which we can notice that the automation of each step is an active domain of research.Fewer works offer an integrated approach that describes how to go from text to a KG.Kertkeidkachorn et al. propose T2KG, an automatic knowledge graph creation framework from natural language text [19].T2KG extracts triples from texts then maps entities and predicates to an existing KG such as DBPedia.T2KG then provides a mechanism to build an open domain KG from texts.
As an alternative to open domain KGs, domain-specific KGs offer many advantages, including noise reduction, improved relevance, better precision and accuracy, and seamless integration with domain-specific applications when the area of interest is targeted.Domainspecific KG construction methods have been developed for e-commerce and applied to build a question-answering task for a product compatibility recommender system [20].This was achieved using questions-and-answers text data related to products from an e-commerce website to generate triples.A KG is constructed from this set of triples according to a defined ontology.The KG is stored as RDF triples and used to automate responses for an e-commerce question-and-answering system, without the direct assistance of human attendants.Another account of a methodology that turns text into a domainspecific KG has been demonstrated over a variety of unstructured text to build a questionanswering mechanism for movies [21].The study uses an open information extraction implementation [22] to generate a list of entity-relation triples.Entities and relations from the extracted triples are encoded using the BERT language model [23] before being used to build a knowledge graph.The obtained KG is used for question-answering, where the author's experiment demonstrates the efficiency of their multi-hop KG traversal and retrieval mechanism.The study also emphasizes the ability of the presented mechanism to build a knowledge graph without alignment with an external knowledge base, unlike some other prevalent studies on knowledge graph construction approaches [24,25].

Business Network Knowledge Graph Construction and Analysis
Current literature on business networking analysis predominantly focuses on the perspective of individual companies understanding their relationships with their counterparts.One such study [26] describes the construction and exploration of and enterprise knowledge graph from a private structured database containing information on 40,000,000 companies.Their results allow the visualization of companies' interconnections, finding the real stakeholders in control of a company, discovering innovative companies that securities establishment would like to invest in, and understanding various types of relationships between companies including competition, patent transfer, investment, and acquisition.Another study [27] demonstrated how a graph-learning model can elicit cooperation and competition links between companies by embedding a graph build from structured data as well.They demonstrate the business value of their finding with a competition and cooperation analysis, showing how their results can be useful to a company's potential partners and competitors and also empowering analysts with insights to partners with the competitor of their competitors, citing the "enemy of my enemy is my friend" principle.We believe those studies will benefit from going beyond structured data.Hillebrand et al. [28] fine-tuned BERT to identify specific concepts such as key performance indicators (kpi), current year monetary value (cy), and davon, that are defined by the authors.In contrast to building a graph, their focus is on analyzing business documents, which is our domain of study, and they demonstrate an approach in handling specific entities within unstructured text documents.
Noticing insufficient attention given to the macroscopic analysis of business networks at a market level, we previously demonstrated the preliminary steps for building a graph for business networking analysis from unstructured text [29].We also showed that using the obtained graph we can make use of machine learning for graph models to classify the nodes, hence making some inferences about the type of node extracted from the text dataset.In this study, we present the design and implementation of a systematic pipeline for the automatic construction of a KG from text.We built a KG to understand interconnections between organizations, people, places, and products.We thus create a tool that generates graph data sources for providing insights into business networking-related tasks.

Materials and Methods
TraCER is a comprehensive toolchain designed as an integrated pipeline, facilitating the transformation of text into a knowledge graph.At its core, the process encompasses several critical subtasks, each contributing to the effective conversion of textual data into a graph format.The pipeline is initiated by a (i) preprocessing task that prepares the text as suitable content to be processed in subsequent stages.From the preprocessed text, (ii) word embeddings are computed, while (iii) relationship triples are extracted.The triple extraction step includes open information extraction (OpenIE), named entity recognition, and a filtering process for noise reduction.The extracted triples are then (iv) categorized.Lastly, (v) the graph is created from the extracted entities and categorized relationships.Figure 1 gives an overview of the methodology.Before describing the steps of our method below, we start by defining the scope of the information we extract to form our knowledge graph using a simple ontology.

Hogan et al.
[2] identify three ways of representing a knowledge graph.The first one is a directed edge-labeled graph that represents a set of nodes and a set of directed labeled edges between those nodes.The nodes can represent any concept.The second is an heterogeneous graphs where each node and edge is assigned a type.The third one is a property graph for additional flexibility to the heterogeneous graph.In this study, we opt for an heterogeneous graph.We determine the scope of our knowledge graph with the following ontology competency questions [30,31] Then, we derive an ontology from schema.org(https://schema.org,accessed on 27 December 2023) with entity types that include organization, person, place, and product.The entity type Organization represents such concepts including the company and business.The entity type Person represents the concept of people.The entity type Place represents the concept of a geographical place.The entity type Product represents products.The relationships are labeled according to the types of entities they involve.When the type of entity involved in a relationship is an organization or person the relationship type is either in competition or collaboration with, it is denoted as "collaborates with" or "competes against", respectively.When the relationship is between an organization and a place, the relationship type is "operates in".For a relationship involving an organization and a product, possible relation types are "consumes" and "produces".Table 1 summarizes the valid relationships with the type of entities as the source and destination, as well as their type of relationship.Figure 2 provides a visualization of a possible graph based on the defined ontology.

Preprocessing
Starting with text available on public news websites, the preprocessing phase aims at normalizing the text data.It consists of a preparatory phase that encompasses a series of operations.It includes the extraction of meaningful content by parsing raw text files notably in HTML format, the handling of intricacies such as duplicate entries and missing data special characters, as well as the elimination of HTML tags.Additional operations typically include tasks such as tokenization, lowercasing, removing punctuation, and stop word removal.Considering those preprocessing steps underpins the efficacy and precision of the natural language processing (NLP) models we will be using downstream in our task, facilitating the extraction of salient data and fostering a higher degree of computational efficiency.

Word Embedding
Language models [23,32,33] and, recently, large language models [34][35][36] have made significant advancements in NLP tasks due to their ability to model context and generate coherent text, but they come with their own set of challenges, such as high computational requirements, ethical concerns, and potential biases.At this iteration of our study, we opt for word embedding for the representation of the text we manipulate for three main reasons.The first reason is computational efficiency.In many real-world applications, computational resources are limited.Word embeddings are far more computationally efficient than large language models.Furthermore, training word embedding from the ground up is feasible using consumer-grade computers.The second reason is task-specific focus.Because our study is specific to the domain of business networking, we want our text representation to rely on a model trained on the very corpus from which we extract the graph.Fine-tuning a large language model is helpful for domain-specific tasks but at the cost of excessive computational demands and the risk of salient data being buried in a relatively large amount of generic text data used to pre-train the language model, in case of scarcity of the text data of interest.The final reason is that our method combines interpretable models in contrast to a large language model, which can reduce the stochasticity of the expected result.An additional reason is how word vectors from word embeddings are used in the relation categorization step of our method.
The skip-gram model from word2vec [37] is the one used to generate word vectors from the corpus obtained after preprocessing.Skip-gram builds word vectors by maximizing the log probability of a word appearing in a window context of a given center word.
where T is the number of words in the vocabulary, i.e., the number of unique words in the corpus.p(w t+j |w t ) is defined using the softmax function: where w o is a context word, w i is a center word, and W is the size of the vocabulary.v w o and v w i are the embedded numerical vector representations, respectively, for a context word and a target word.The final v w vector representation of a given word w in the vocabulary is computed by pooling its vector representation as a context v w o word and its vector representation as a center word v w i [38].

Triplet Extraction
Our methodology employs open information extraction (OpenIE) for the automated identification and extraction of relationships, facts, and entities from unstructured text without relying on human intervention.OpenIE typically starts by segmenting sentences from the text.Then, it performs part-of-speech tagging and dependency parsing on each sentence to identify its grammatical structure, including the roles of words and their relationships within the sentence.To simplify the identification of relations in sentences, nominalization is applied to transform verbs into nouns.Candidate relations and their corresponding arguments (entities or noun phrases) are identified based on the grammatical structure of the sentences using patterns and heuristics.Relations are extracted as fact triples that consist of a subject, a relation, and an object.
We apply a specific OpenIE approach [39] that generates triples by first breaking a long sentence into a short coherent clause.This is achieved by modeling the task as a search problem with three types of actions: yield, recurse, and stop.Secondly, finding the maximally simple relation triple is warranted given each of these clauses.This is achieved by adopting a subset of natural logic semantics dictating context in which lexical items can be removed.Then, the short entailed sentences are segmented into conventional open information extraction (OpenIE) triples.
We added a filtering process on top of the triplet extraction to only keep triples of interest for our task.The filtering process consists of keeping triples where both the subject and the object contain named entities type of interest, as defined in the Section 3.1.Algorithm 1 provides pseudocode of the filtering process.ExtractTriple is a subroutine that extracts triples from a sentence using OpenIE.FindNamedEntity is a subroutine that identifies named entities within parts of the extracted triple such as the subject or the object.The named entity recognition (NER) process relies on the tokenization, part-of-speech tagging, and lemmatization of the text.The named entities are classified with a maximum entropy Markov model [40].From most pre-trained NER models, we can recognize and filter entity types appearing in our ontology, namely organizations, people, and places.We customize the model to tag and identify another entity type of interest for our domain of study: products.
Furthermore, type returns the type of a given named entity.Considering S the set of sentences and T s_max the largest set of triples extracted from a sentence, the proposed triple-filtering process exhibits a complexity of O(|S | × |T s_max |), with |S| and |T s_max | the cardinalities of S and T s_max , respectively.

Reducing Duplicates
To mitigate the impact of noisy and duplicate data in our extraction process, we implement two key strategies: coreference resolution and a heuristic-based entity-linking process.Coreference resolution identifies different textual expressions referring to the same entity, enhancing our understanding of entity relationships.This results in a better understanding of the relationships between entities mentioned differently in the text.Examples of solved mentions are pronouns (e.g., "he", "she", "it") and generic entity coreferences (e.g., "the former").For further redundancy reduction in identified relationships, we have developed a heuristic-based entity-linking process, detailed as follows: Definition 1.Let t be a triple extraction in T the set of extracted triples.Let e be a named entity with type(e) the entity type of e, the object of t. e ′ , the named entity in the subject of t, is a link of e if the text of the relation part of t is 'be' and type(e) == type(e ′ ).
This allows us to link two entities that belong to the same extraction triple for which the subject and object consist only of one named entity.The subject and object's named entities have the same type, and the predicate of the triple is the word "be".Illustrative examples are provided in Appendix B.

Relation Categorization
Upon completing the previous steps, our methodology yields a network representation where nodes represent named entities of interest, and links are labeled according to relationships from OpenIE-generated triples.To further refine our network, we categorize these links.Implementing this step is beneficial for downstream graph-related tasks such as link prediction which require knowledge of link types.
Link categorization is inherently a classification task, typically approached via supervised learning which necessitates labeled data.However, aiming to simplify the construction process of our study, we opted for an unsupervised learning approach, minimizing the need for extensive data labeling.
A vector produced using word-embedding algorithms such as Skip-gram [37] can be manipulated with a vector offset method to identify linguistic regularities in continuous space word representations [41].Using algebraic operations on the vectors representing the words, analogies can be drawn.A frequently used example is king − man + woman = queen, meaning that the closest word vector to the one we obtain when subtracting the vector of the word man to the vector representing the word king then adding the vector woman is the vector that represents the word queen.We can deduce that king − man ≈ queen − woman, assuming that the "−" operator acts as a pooling mechanism that gives a vector that represents the relationship between the word king and the word man, which should be close, i.e., approximately equal to the vector that represents the relation between the word queen and the word woman.Using this insight, we compute a vector representation for each relationship triple extracted using OpenIE by applying the vector offset method [41] to the vectors representing the named entity in the subject and the one representing the named entity in the object.Relationship vectors are then grouped according to the type of entities involved in their subject and object, as described in the Section 3.1.For each group, a clustering algorithm is used to assign the type of relationship.
Conforming to the ontology, and going through the described graph-building process, we ensure the construction of a domain specific knowledge graph for business networking and interaction, cleared of noisy entity types and irrelevant relationships.However, it is important to note that relationships not defined in the ontology are omitted from the graph.Further implications and limitations are described in Section 5.2.

Implementation Details
The TraCER pipeline is predominantly implemented using the Python programming language, except for the triple extraction.Preprocessed text datasets are formatted as comma-separated values (CSV) files, serving as input to the pipeline.The CSV format includes, in sequential order, the article's text, title, publication date, and topic.It is worth noting that only the article content is mandatory, while the other data fields remain optional.The embeddings are computed using the Gensim [42] Python library's Skip-gram implementation.
The triple extraction feature, named triplex, is implemented directly in Java, which is the programming language used by the Stanford CoreNLP library (https://stanfordnlp. github.io/CoreNLP,accessed on 27 December 2023), instead of using a Python wrapper, for performance reasons.We chose the Stanford CoreNLP library for its comprehensive set of features, which align well with our implementation's requirements [43].Triplex is responsible for coreference resolution, open information extraction, and named entity recognition.Our setup includes tokenization, sentence splitting, part-of-speech tagging, lemmatization, named entity recognition, and open information extraction.The triplefiltering algorithm is implemented on top of the triple-extraction setup.We employed additional configurations for named entity recognition where necessary.Since we are considering the field of business networking, we are interested in entities that represent organizations, places, people, and products.The CoreNLP library inherently recognizes the first three entity types.To address products, we curate a specific set, which is then integrated into the named entity recognition system using the additional TokensRegexNER configuration mechanism within the library.The entity-linking mechanism takes as input the triples generated by triplex to identify linked entities and eliminate reflexive relations.
Following this step, the relationship categorization feature employs the outputs of the embedding stage and triplex to represent relations and categorize them as described in Section 3.6.In the process of extracting triples from the text, we assign relationship types to the extracted triples based on our predefined ontology.This assignment follows a structured approach: First, we group relationships according to the types of entities involved.Next, we represent these relationships as vectors using the vector offset method, which leverages the word vectors of the two entities linked by the relationship.Within our ontology, each group of relationship pairs typically falls into one of two categories: those with a single relationship type and those with two relationship types.For the former, we straightforwardly assign the sole relationship type.However, in the case of groups with two relationship types, we employ K-means clustering to categorize these relationships effectively.The resulting K-means clusters of vector relationships can then be labeled by human analysts for interpretability and context.We use the K-mean clustering algorithm implementation from the Scikit-learn Python package [44].
From the result, we build a graph using NetworkX 3.2 (https://networkx.org/,accessed on 27 December 2023), a Python 3 package for the creation and manipulation of complex networks.NetworkX was selected because of the rich features it offers to study the structure and dynamics of networks, and its network representations are also compatible with notable graph-learning packages, such as Deep Graph Library 1.1 (https://www.dgl.ai/accessed on 27 December 2023), that we use down the line for predictive tasks.

Results
We applied our proposed methodology to a corpus related to business news from the web.

Dataset
The Reuters-21578 is a collection of documents that appeared on the Reuters newswire in 1987.The documents were assembled and indexed with categories by personnel from the Reuters Ltd. news company.The version of the dataset used in this study, which contains 21,578 documents, is a curated collection that was made available in 1996.This dataset was specifically built to serve as a resource for corpus-based research, in areas such as information retrieval and text categorization.
This dataset is suitable for our study because its size allows for relative rapid prototyping and the implementation of solutions for hypothesis testing.It offers various amounts of metadata related to the documents including dates and topics which can be leveraged in building insightful models.Moreover, the Reuters-21578 dataset is freely accessible throughout the web (https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html,accessed on 27 December 2023), making it easy for third-party reproduction of the obtained results.

Graph Extraction
Our experimental dataset is provided as SGML (Standard Generalized Markup Language) files, which is a hierarchical tag structured markup language, similar to HTML.All the preprocessing steps described in Section 3.2 targeting HTML files had to be applied to the Reuters-21578 dataset to extract the article content as plain text.During preprocessing, special characters not pertinent to the newswire content were removed.Articles focused on earnings, which essentially consist of tables of numbers and articles with less than a hundred characters, were discarded, resulting in a dataset of 16,828 articles.
Given the small size of our experimental dataset, we selected the following key hyperparameters when computing the embeddings.The window size, which describes the range of context, was set at five words, and the vector size of the embeddings at 50 dimensions.
Using the triplex tool for triple extraction on the curated Reuters-21578 dataset yielded 13,842 triples.Table 2 provides a sample of identified named entities from the extracted triples.
We also applied our entity-linking heuristic to identify entity links from the triple.Despite being simple, the heuristic is quite effective at linking the entities to their different names including abbreviations and names in other languages.The triples extracted are relations that involve named entities of interest appearing both in their subject and object, with reduced noise, redundancies, and linked entities.Table 3 is a sample list of recognized entity links from the triples.After applying the entity-linking heuristic, we obtained 11,849 triples of interest.A sample of extracted triples is provided in Table 4.We evaluate the linking heuristic using an F-measure, the F1 score which is the harmonic mean of precision and recall, calculated using: where precision = TP TP + FP , and recall = TP TP + FN , with TP (True Positive) being the number of samples that are correctly identified as entity links, FP (False Positive) being the number of samples that are incorrectly identified as entity links, and FN (False Negative) being the number of samples that are entity links that were not identified by the heuristic.We measured an F1 score of 0.82, showing that the heuristic can successfully identify local entity links, even though there is room for improvement.For the relation categorization step, the clustering accuracy was evaluated using the adjusted random score.From the Reuters dataset extractions, a K-mean clustering of relationships involving organizations and persons effectively put 86% of collaborative and competitive relationships in the same cluster.Relationships involving organizations and products were effectively clustered into production and consumption relationships with an 82% accuracy score using the same approach.Excluding relationships with one category, this gives an average accuracy of 84% on the Reuters dataset.
Putting the text dataset through the proposed pipeline, we can represent the data as an interconnection of nodes representing entities of interest and their categorized interconnection.Figure 3 is an overview of the resulting network.Appendix A provides labeled graphs extracted from sampled articles from the Reuters-21578 dataset for detailed visualization.

Discussion
Unlike existing text-to-knowledge graph approaches that use sentence stubs like verbs or noun phrases as link labels [21], our pipeline classifies relations within predefined categories, requiring minimal resources and no annotations.The categorized links present an opportunity for the link prediction algorithm to predict the nature of links between two entities in addition to predicting the existence of links.
Our experimental dataset is modest by contemporary standards.This can raise the question of the scalability of our methodology.While this is not addressed in this study, we leave it as a prospect for future work.One way to address the scalability is to apply our methodology to subsets of a large-scale text dataset and merge the resulting networks.For instance, Wu et al. [45] have demonstrated state-of-the-art methodologies in this area.
Additional preliminary steps were considered before building the network, including coreference resolution and entity linking as well as a reduction in noisy data such as duplicated entities in the graph by more than 70% compared to the graph generated from the previous implementation [29].Appendix C provides an ablation study for a detailed analysis of the impact of key subtasks in reducing redundancies during the construction of the graph.

Use Cases
In our previous work [29], we showcased the embedding of the resultant graph using a random walk approach [13].We then employed node property classification to recover missing node properties, such as entity types [29].In this section, we extend our exploration of potential use cases for the generated graph, particularly within the realm of business networking analysis.
A natural application that emerges is graph neural-network-based link prediction, which is instrumental in uncovering previously undetected interactions between organizations or entities that may have gone unmentioned in the text.
By applying our proposed graph extraction pipeline to articles grouped by timestamp windows across a chronological sequence, we can curate a temporal graph dataset, which is a dataset that enables the use of temporal graph representation learning methods.Such a dataset can be used to predict future business networking interactions through inductive link prediction over graphs.
Moreover, another compelling utilization of the graph resulting from our proposed method is the extraction of trading-related content from news texts originating from diverse international markets.For instance, by conducting pairwise graph isomorphism tests on subgraphs composed of organization nodes and their respective neighborhoods from either of the generated graphs, we could effectively identify similar companies operating in different markets.
These extended applications of our methodology underscore its versatility and potential impact in facilitating a deeper understanding of dynamic business networks and inter-market relationships.

Current Limitations
Our methodology can generate graphs with sufficient quality to be used for business networking analysis at low computation costs.Yet two noticeable limitations can be identified within the current iteration of our methodology.
Extracting relations with OpenIE is primarily a rule-based and heuristic-driven approach.Its performance depends on the quality of linguistic patterns and heuristics used for relation extraction.While it can extract valuable structured information from text, it may not capture complex relationships described with more than one sentence or other nuances in the same way that more advanced large language models are capable of.
A relation classification accuracy score of 84% shows the effective ability of grouping relationship types before applying clustering models to classify relationships based on word vector representation without annotations.Yet this score is well below the state-of-theart score in stand-alone relation classification tasks that reach 97% on partially annotated datasets such as Few-Rel while relying on large language models and few-shot learning [46].
The integration of large language models in the extraction of a business network knowledge graph based on our ontology is the primary concern of our next iteration.This will come at the cost of high computation and other drawbacks described earlier in Section 3.3.

Conclusions
Effectively transforming business news into structured networks for analysis presents significant opportunities for business networking and thus motivated this work.This study introduces TraCER, a pipeline that solves the task of converting unstructured news text into a graph.The pipeline, composed of key subtasks including text preprocessing, named entity recognition, entity-linking, triple extraction, and relation classification, demonstrated the ability to automatically build a domain-specific knowledge graph for the analysis of business networking from the Reuters corpus.TraCER is computationally efficient and based on interpretable models and heuristics.
TraCER opens possibilities including business-to-business interaction predictions and business type predictions.Beyond these applications, TraCER can enable tracking chronological business interactions and matching businesses across different markets.However, the current stage of our study presents some limitations.One limitation is the inability to identify relationships described with multiple sentences.This paves the road for future works that will involve the use of large language models in the construction of business networking knowledge graphs.
From the triples, the named entities are isolated and used to classify the relationships into a category from the ontology by employing the clustering mechanism proposed.Table A2 shows named entity pairs involved in a triple relationship and their corresponding assigned type.We can observe that even though some names are mentioned in the text, they are not part of the extracted named entities.This happens because they are likely not appearing in OpenIE extractions that respect our ontology rules; thus, they end up discarded.This might be a limitation as discussed in Section 5.2 if a relationship is described with more than one sentence.This is planned to be addressed in future work with the use of language models.
From the identified entities and categorized relationships, we can represent the graph.Figure A1 presents the resulting graph for the sampled article.This detailed perspective on the results further underscores the efficacy of our automated graph extraction approach in capturing and visualizing the key insights within the textual data for business networking and emphasizing the clarity of our research findings.

Appendix B
Here, we provide a couple of examples for the linking heuristic described in Section 3.5.The examples consist of sentences, a candidate triple extraction with associated named entity types, and whether or not they represent an instance of entity link.

Example 1
Sales of previously owned homes dropped 14.5 pct in January to a seasonally adjusted annual rate of 3.47 mln units, the National Association of Realtors (NAR) said.
The candidate triple (National Association of Realtors [ORGANIZATION], be, NAR [ORGANIZATION]) matches the definition; thus, NAR is identified as being the same entity as the National Association of Realtors.

Example 2
Kevlar was invented by Du Pont in the late 1960s and is five times stronger than steel and 10 times stronger than aluminum on an equal wieght basis, and is used to replace metals in a variety of products, according to the company.
The candidate triple (Kevlar [PERSON], be, Du Pont [ORGANIZATION]) is an example that does not match the definition; thus, this triple will be later filtered and classified as an edge in the graph.

Appendix C
We conduct an ablation study to understand the impact of key steps in our proposed pipeline.The ablation consist of comparing the number of extracted entities and relationships after using different configurations of our pipeline on the Reuters corpus.The configurations include the full TraCER pipeline, the TraCER pipeline without the implementation of co-reference resolution (no coref), and the TraCER pipeline without the entity linking heuristic (no link).Table A3 summarizes the extractions results from which we can observe the contribution of each of those steps in reducing noises and redundancies in the resulting graph.Another configuration was to remove the relation categorization from the full pipeline to inspect the number of unique relationships.Without categorizing the relationship, the pipeline produces 3379 unique relationships, from which a large portion of different instances appear fewer than ten times.This would have been an handicap for link prediction algorithms without the addition of the categorization mechanism that reduces the number of unique links to five.

Figure 1 .
Figure 1.Overview of the proposed method.

Figure 2 .
Figure 2. Visual of a possible extraction based on the defined ontology.

Figure 3 .
Figure 3. Visualization of the obtained network after running the pipeline over the entire preprocessed Reuters-21578 corpus.Entities' nodes are positioned according to the 2D projection of their vector representation.The node colors determine the type of entities and the link colors the group/class of the relation.

Figure A1 .
Figure A1.Visualization of the graph extracted from the sample article.

Table 1 .
Type of source and destination entities and their possible relationship types, derived from our ontology.

Algorithm 1
Triplet Filtering Require: S the set of sentences in the dataset Require: C the set of named entity types of interest T ← ∅ /* Set of filtered triples */ for all s ∈ S do T s ← ExtractTriples(s) for all t ∈ T s do s e ← FindNamedEntity(t.subject) o e ← FindNamedEntity(t.object) if ∃ type(s e ) ∈ C and ∃ type(o e ) ∈ C then

Table 2 .
Sample of identified named entities from the Reuters corpus and their associated type.

Table 3 .
Sample of identified entity links.

Table 4 .
Sample triple extractions from the Reuters corpus after running OpenIE with the triples filter.

Table A1 .
Triples extracted from the sample article.

Table A2 .
Type and named entity involved in extracted relationships and their respective assigned relationship type.

Table A3 .
Number of extracted entities and relationships for different configurations of the TraCER pipeline.