You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

21 December 2021

Unsupervised Event Graph Representation and Similarity Learning on Biomedical Literature

,
,
and
1
Department of Computer Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy
2
Independent Researcher, 48018 Faenza, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Role and Challenges of Healthcare Cognitive Computing: From Extraction to Data Analysis Techniques

Abstract

The automatic extraction of biomedical events from the scientific literature has drawn keen interest in the last several years, recognizing complex and semantically rich graphical interactions otherwise buried in texts. However, very few works revolve around learning embeddings or similarity metrics for event graphs. This gap leaves biological relations unlinked and prevents the application of machine learning techniques to promote discoveries. Taking advantage of recent deep graph kernel solutions and pre-trained language models, we propose Deep Divergence Event Graph Kernels (DDEGK), an unsupervised inductive method to map events into low-dimensional vectors, preserving their structural and semantic similarities. Unlike most other systems, DDEGK operates at a graph level and does not require task-specific labels, feature engineering, or known correspondences between nodes. To this end, our solution compares events against a small set of anchor ones, trains cross-graph attention networks for drawing pairwise alignments (bolstering interpretability), and employs transformer-based models to encode continuous attributes. Extensive experiments have been done on nine biomedical datasets. We show that our learned event representations can be effectively employed in tasks such as graph classification, clustering, and visualization, also facilitating downstream semantic textual similarity. Empirical results demonstrate that DDEGK significantly outperforms other state-of-the-art methods.

1. Introduction

Advances in both research and computational methods act as the catalyst for a large number of scientific publications, especially in the biomedical domain []. Identifying and tracking the discoveries and theories disseminated within life science papers is critical to speed up medical progress. However, keeping up with all the latest articles has become increasingly challenging for human domain experts. According to bibliometric results verifiable at https://pubmed.ncbi.nlm.nih.gov/ using the query “1700:2021[dp]”—accessed on 11 December 2021—the annual rate of biomedical publications registered on PubMed is growing exponentially. As biology literature increases in the form of text documents, the automatic extraction of biomedical entities (e.g., proteins, genes, diseases, drugs) and their semantic relations have been intensively investigated []. Recent developments in deep learning (DL) and natural language processing (NLP) have enabled intelligent ways to uncover structured, concise, and unambiguous knowledge mentioned in large unstructured text corpora. In biomedical text mining, classified named entities and pairwise relations are notoriously insufficient for understanding interactions ranging from molecular-level reactions to organism-level outcomes []. As a result, the extraction of complex relations (namely, events) from the text—involving nested structures and multiple participants—has become a quintessential task knotted with semantic parsing. According to the BioNLP-ST competitions [,,], events are composed of a trigger (a textual mention which clearly testifies their occurrence, e.g., “interacts”, “regulates”), a type (e.g., “binding”, “regularization”), and a set of arguments with a specific role, which can be entities or events themselves. Following in the footsteps of NLP breakthroughs, events underpin many valuable applications, such as literature-based knowledge discovery [], biological network construction [], pathway curation [], diagnosis prediction [], and question answering [].
Like many other relational data in real-world scenarios, bio-events can be conveniently and naturally modeled as graphs (directed, acyclic, and labeled) []. As depicted in Figure 1, nodes are given by triggers (roots) and entities, while edges represent argument roles. In this vein, graphs are a ubiquitous modeling tool with solid mathematical foundations. Notably, the application of deep neural networks on non-Euclidean graph-structured data is an emerging trend [,]. Effective graph analytics provide researchers with a deeper understanding of the data, supporting various tasks in multiple domains, including bioinformatics, chemoinformatics, neuroscience, and sociology. These tasks range from protein folding to recommender systems, from social network analysis to molecular studies and drug design [].
Figure 1. A biomedical event graph extracted from a sentence. The example exhibits two nested events: (i) a Positive regulation event (+Reg), anchored by a trigger “promote”, affecting “tumorigenesis” as Theme argument; (ii) a Gene expression of “Bmi-1” with (i) as a Cause argument.
Leaving aside link predictions and node-, edge-, or graph-level classifications, one of the fundamental problems is to retrieve a set of similar graphs from a database given a user query. The matching, clustering, ranking, and retrieval of biomedical event graphs expressed in the literature can significantly facilitate the work of scientists in exploring the latest developments or identifying relevant studies. For instance, it would allow searching for interactions similar to one provided in input, looking for events involving certain biomedical entities or types of participants with precise roles. Additionally, it would enable the aggregation and quantification of highly related concept units, recognizing the number of times an interaction has been expressed. To solve this task, graph similarity metrics based on predefined distance notions and feature engineering are very costly to compute in practice and hardly generalizable []. In the last few years, researchers have therefore formulated graph similarity estimation as a learning problem []. Deep graph similarity learning (GSL) uses deep strategies to automatically learn a metric for measuring the similarity scores between graph object pairs. The key idea is training a model that maps input graphs to a target vector space such that the distance in the target space approximates the one between the symbolic representations in the input space (Figure 2). It follows that GSL is strongly interconnected with graph representation learning (GRL). By projecting raw graph data into continuous low-dimensional vector spaces while preserving intrinsic graph properties, the so-called graph embeddings provide a general tool to efficiently store and access relational knowledge in both time and space []. These features can promote the resolution of many machine learning problems on graphs via standard frameworks suitable for vectorized representations.
Figure 2. Illustration of similarity-preserving event graph embeddings, with sample events taken from a cancer genetics database (CG13). Each event graph—for which is shown also the textual mention—is mapped into an embedding vector (denoted as a dot in the simplified 2D space). The target embedding function should consider both structure and semantics. The example reports some interesting cases: (i) graphs with same structure and overall semantics ( G 1 and G 2 ), where NB is the acronym for neuro-blastoma; (ii) graphs with different structure but similar semantics ( G 2 and G 3 ), where the process represented by G 2 is contained in G 3 ; (iii) graphs with same structure but different semantics ( G 4 and G 5 ). For precision, we underline that our embeddings are not calculated directly from the raw literature but from events mentioned in abstracts or full texts, mainly obtainable from existing annotations or predictions of event extraction systems (blue arrow).
Despite their potential, GSL and GRL contributions dedicated to the event sphere are still scarce and limited. Indeed, the high representational power of event graphs brings with it considerable challenges. First, events are intrinsically discrete objects with an irregular structure that can vary significantly across instances, even if associated with the same event type; they are more difficult to analyze than text/image/video/audio defined on regular lattices. Second, there are no datasets reporting similarity scores between event pairs for supervised approaches, which can be tedious and time-consuming to produce in large numbers. Nevertheless, existing works mostly fail to utilize unlabeled graph pairs for metric learning. Third, systems should be interpretable in the process of predicting similarities; in the biomedical domain, more than others, reliability is a requirement that cannot be ignored. Fourth, as is often the case with labeled graphs [,], subtle nuances in labels can make two events semantically agreeing or not, while events with distinct structures can still be similar. Therefore, to be broadly useful, a successful solution should consider both event graph structure and semantics, fine-grained (e.g., synonym) and coarse-grained (broad scenario). Fifth, many methods rely on a known correspondence between the nodes of two graphs, becoming impractical in the event realm, where node labels are free text spans and there are multiple ways of referring to the same entity or relation (i.e., no id sharing). Sixth, looking closer to GRL, most work focuses on node-, edge-, or community-level representations above a single large input graph. Instead, events extracted from a corpus correspond to many independent and small-sized graphs, primarily asking for whole-graph embeddings. Seventh, event representations should be general, but present solutions tend to learn intermediate graph and node embeddings only as a precondition to maximize accuracy on a set of narrow classification tasks, giving rise to heavily biased vectors. Finally, we would like not to be bound to a set of graphs known in advance, preferring inductive embedding approaches.
This paper addresses the problem of similarity and representation learning for biomedical event graphs. Specifically, we propose an unsupervised method for computing general-purpose event graph representations using deep graph kernels to satisfy all the aforementioned needs. With the term “unsupervised”, we mean models based only on information available in the event graphs (structure and node/edge features), without task-specific labels (e.g., similarity scores) or loss functions. Our work is based on DDGK [], which is capable of deriving a kernel from structural and semantic divergences between graph pairs without requiring feature engineering, algorithmic insights, similarity labels, structural or domain-specific assumptions (like graph primitives’ importance). Moreover, it includes a cross-graph attention mechanism to probabilistically align node representations, aiding the interpretation of proximities by looking for similar substructures. Guided by event characteristics in the NLP area, we provide an extended version of DDGK, called Deep Divergence Event Graph Kernels (DDEGK), integrating pre-trained language models to capture semantic consistencies among continuous labels.
To sum up, our main contributions are:
  • In-depth literature analysis. We offer newcomers in the field a global perspective on the problem with insightful discussions and an extensive reference list, also providing systematic taxonomies.
  • Deep Divergence Event Graph Kernels. A novel method of event graph similarity learning centered on constructing general event embeddings with deep graph kernels, considering both structure and semantics.
  • Experimental results. We conduct extensive experiments to demonstrate the effectiveness of DDEGK in real-world scenarios. We show that our solution successfully recognizes fine- and coarse-grained similarities between biomedical events. Precisely, when used as features, the event representations learned by DDEGK achieve new state-of-the-art or competitive results on different extrinsic evaluation tasks, comprising sentence similarity, event classification, and clustering. To shed light on the performance of different embedding techniques, we compare with a rich set of baselines on nine datasets having distinct biological views.
The rest of the paper is organized as follows: First, in Section 2, we examine related work, outlining a bird’s-eye view of the research context. Then, to provide a solid foundation, Section 3 introduces notation and preliminary concepts necessary to understand our study. Next, Section 4 details the datasets and clearly describes the architecture and operation of DDEGK, while Section 5 presents the experiments and the results obtained. Section 6 explains findings, limits, and applicability. Finally, Section 7 closes the discussion and points out future directions.

3. Notation and Preliminaries

Before continuing with our contribution, we formally provide the necessary notation and definitions of the core concepts that will be used throughout the paper.
Definition 1. (Event Graph). A graph, denoted by G = ( V , E ) , consists of a finite set of vertices, V = v 1 , , v | V | , and a set of edges, E V × V , | E | = m , where an edge e i , j connects vertex v i to vertex v j . In the case of event graphs, edges are directed, unweighted, and there are no cycles. A vertex represents a trigger or an entity, while an edge models an entity-trigger or a trigger–trigger relation, with the second applying for nested events. Node connections are encoded in an adjacency matrix A R | V | × | V | , where a i j = 1 if there is a link between nodes i and j, and 0 otherwise. Nodes and edges in G are associated with type information. Let τ v : V T v be a node-type mapping function and τ e : E T e be an edge-type mapping function, where T v indicates the set of node (event or entity) types, and T e the set of edge (argument role) types. Each node v i V has one specific type, τ v ( v i ) T v ; similarly, for each edge e i j , τ e ( e i j ) T e . Since | T v | + | T e | > 2 , event graphs are heterogeneous networks. An event graph is also endowed with a label function γ v : V Γ v that assigns unconstrained textual information to all nodes. We say that γ v ( v i ) is the continuous attribute of v i .
Definition 2. (Graph and Subgraph Isomorphism). Two unlabeled graphs G 1 and G 2 are isomorphic, denoted by G 1 G 2 , if there exists a bijection ϕ : V ( G 1 ) V ( G 2 ) , such that ( u , v ) E ( G 1 ) if ( ϕ ( u ) , ϕ ( v ) ) E ( G 2 ) for all ( u , v ) in E ( G 1 ) . Then, ϕ is an isomorphism. For labeled graphs, isomorphism holds only if the bijection maps vertices and edges with the same label. Subgraph isomorphism is a generalization of the graph isomorphism problem, where the goal is to determine whether G 1 contains a subgraph that is isomorphic to G 2 . No polynomial-time algorithm is known for graph isomorphism. Accurately, while subgraph isomorphism is known to be NP-complete, the same cannot be said for graph isomorphism, which remains NP.
Definition 3. (Graph Kernel). Given two vectors x and y in some feature space R n , and a mapping φ : R n R m , a kernel function is defined as k ( x , y ) = φ ( x ) T φ ( y ) . In a nutshell, a kernel is a similarity function—satisfying the conditions of symmetry and positive definiteness—that can be interpreted as the dot product of two vectors after being projected into a new space. Thus, a kernel function can be applied to compute the dot product of two vectors in a target feature space (typically high-dimensional) without the need to find an explicit space mapping. In structure mining, a graph kernel is simply a kernel function that computes an inner product on graph pairs, typically comparing local substructures [].
Definition 4. (Whole Graph Representation Learning). Whole graph representation learning aims to find a mapping function Ψ from a discrete graph G to a continuous vector Ψ ( G ) R d , preserving important graph properties. If d is low, we talk about graph embedding. Graph embedding can be viewed as a dimensionality reduction technique for graph-structured data, where the input is defined on a non-Euclidean, high-dimensional, and discrete domain.
Definition 5. (Graph Similarity Learning). Let G be an input set of graphs, G = G 1 , G 2 , , G n , graph similarity learning aims to find a function S : ( G i , G j ) R , returning a similarity score s i j for any pair of graphs ( G i , G j ) G .

4. Materials and Methods

4.1. Datasets

Nine real-world datasets—originally designed for biomedical event extraction—are used within this paper to evaluate DDEGK on gold instances (and not silver ones), thus avoiding additional error sources in system predictions. Most of them are benchmarks with human-curated labels introduced by the ongoing BioNLP-ST series, one of the most popular community efforts in biomedical text mining. Each deals with topics from a distinct sub-area of biology, supported by literature-based corpora.
  • BioNLP-ST 2009 (ST09) []. Dataset taken from the first BioNLP-ST challenge, consisting of a sub-portion of the GENIA event corpus. It includes 13,623 events (total between train, validation, and test sets) mentioned in 1210 MEDLINE abstracts on human blood cells and transcription factors.
  • Genia Event 2011 (GE11) []. Extended version of ST09, also including ≈4500 events collected from 14 PMC full-text articles.
  • Epigenetics and Post-translational Modifications (EPI11) []. Dataset on epigenetic change and common protein post-translational modifications. It contains 3714 events extracted from 1200 abstracts.
  • Infectious Diseases (ID11) []. Dataset on two-component regulatory systems; 4150 events recognized in 30 full papers.
  • Multi-Level Event Extraction (MLEE) []. Dataset on blood vessel development from the subcellular to the whole organism; 6667 events from 262 abstracts.
  • Genia Event 2013 (GE13) []. Updated version of GE11, with 9364 events extracted exclusively from 30 full papers.
  • Cancer Genetics (CG13) []. Dataset on cancer biology, with 17,248 events from 600 abstracts.
  • Pathway Curation (PC13) []. Dataset on reactions, pathways, and curation; 12,125 events from 525 abstracts.
  • Gene Regulation Ontology (GRO13) []. Dataset on human gene regulation and transcription; 5241 events from 300 abstracts.

4.1.1. Data Preprocessing and Sampling

Instances within the datasets are formed by a text document (.txt) and two annotation files in a standard format adopted by BioNLP-ST, one dedicated to gold entities (.a1) and one including triggers with their relationships (.a2). We parse standoff files (.a*) to automatically transform labeled text spans into event graphs (Figure 5).
Figure 5. Example of .a* parsing. The textual document is shown for clarity.
Given the large number of records, we build a sample of ≈1000 events from the training set of each dataset using stratified random sampling (without replacement) to retain the statistical information of the population. We stratify on multiple variables, namely the event type (the main one in case of nesting, i.e., the event graph root) and the node number (discretized in a range distribution, i.e., 2, 3, 4, 5, 6, >7), keeping their original proportions. We remove event types represented by less than ten instances due to highly unbalanced data. To experiment with mapping events from different sources towards a single shared space, we create an artificial dataset—hereafter referred to as “bio_all”—resulting from the combination of the previous nine (additionally stratifying on the source). The selected sample dimension is comparable to or greater than that of other chemo/bio-informatics graph classification datasets, like D&D [] (1178), PTC-MR [] (344), ENZYMES [] (600), and MUTAG [] (188). The reader should be aware that the test sets of the BioNLP-ST corpora are generally not provided, leading users to upload their predictions to the task organizers’ servers. Due to the unavailability of these data, we refer only to the event instances included in the training sets. A concise summary and detailed visualization of the final datasets can be found in Table 1 and Figure A1, respectively. The restricted size of event graphs is offset by the high diversity of classes (17, on average, compared to the 2–6 labels of conventional biomedical GRL benchmarks []).

4.2. Deep Divergence Event Graph Kernels

This section describes how DDEGK works. It recalls the original DDGK method and highlights our changes, including a new loss function and specific variations to handle biomedical events.
Table 1. Descriptive statistics for sampled datasets. The first column lists the examined datasets. The second column indicates the number of event graph instances (nested events count one), while the third and fourth macro-columns detail the size of events in terms of nodes and edges. Finally, the last macro-column refers to the distinct number of types for events (triggers), entities, and argument roles in each dataset, thus marking the number of possible classes for graphs, nodes, and edges.
Table 1. Descriptive statistics for sampled datasets. The first column lists the examined datasets. The second column indicates the number of event graph instances (nested events count one), while the third and fourth macro-columns detail the size of events in terms of nodes and edges. Finally, the last macro-column refers to the distinct number of types for events (triggers), entities, and argument roles in each dataset, thus marking the number of possible classes for graphs, nodes, and edges.
Dataset# Graphs# Nodes# Edges# Labels
MinMeanMaxMeanGraphNodeEdge
ST091007241439113
GE111001241439112
EPI111002236210111
ID111001231429163
MLEE10122415315398
GE131011231328132
CG1310332313223508
PC1310202418315259
GRO1310062352181404
BIO_ALL1216231435310112

4.2.1. Problem Definition

Given a family of N event graphs T = G 1 , G 2 , , G N , our goal is building an embedding for each event graph G i (which we call target event graph), based on its distance from a set of M anchor (or source) event graphs ( A ). To accomplish this, we aim to learn a graph kernel function, defined as the following:
k ( G 1 , G 2 ) = | | Ψ ( G 1 ) Ψ ( G 2 ) | | 2 ,
where the representation Ψ ( G T ) R M . In particular, for any member of T , we define the i th dimension of the representation to be:
Ψ ( G ) i = v j V ( G ) f a i ( v j ) ,
where a i A and f a i ( ) is a predictor of some structural and semantic properties of the graph G parameterized by the graph a i . The anchor and target graphs sets ( A , T ) could be disjoint, overlapping, or equal.
In other words, we propose to learn the representation of an event graph by comparing it against a population of other event graphs taken as a reference and forming the basis of our target vector space (Figure 6).

4.2.2. Event Graph Representation Alignment

To define the similarity between two event graphs in a t a r g e t , a n c h o r pair Equation (2), we rely on deep neural networks. We train an encoder model to learn the structure of the anchor event graph. The resulting model is then used to predict the structure of the target event, allowing divergence measurement. If the pair is similar, it is natural to expect the anchor encoder to correctly predict the structure of the target event graph. However, two event graphs may not share vertex ids and differ in size, albeit similar in the concept unit expressed. Likewise, nodes of structurally equivalent graphs can have completely different attributes. For this reason, we use a cross-graph attention mechanism to learn a soft alignment between the nodes of the target event graph and the anchor one.
Below, we illustrate the functioning of the various components.
Anchor Event Graph Encoder. The quality of the graph representation depends on the extent to which each encoder is able to discover the structure of its anchor. Therefore, the role of the encoder is to reconstruct such structure given partial or distorted information. Analogously to Al-Rfou et al. [], we choose a Node-To-Edges setup, where the encoder is trained to predict the neighbors of a single vertex in input. Although other strategies are undoubtedly possible, it greatly matches event graph characteristics while remaining efficient and straightforward to process. By modeling the problem as a multi-label classification task, we maximize the following objective function:
J ( θ ) = i j e i j E l o g P r ( v j | v i , θ ) .
First, each vertex v i in the graph is represented by a one-hot encoding vector v i . Second, we multiply the encoding vector with a linear layer E R | V | × d , resulting in an embedded vertex e v i R d , where d is the size of the embedding space. Third, the embedding e v i (feature set) thus obtained is passed to a fully connected deep neural network (DNN), which produces scores for each vertex in V (i.e., output layer of size | V | ). Finally, scores are normalized using the sigmoid function to generate final predictions.
Figure 6. Overview of DDEGK. Structural and semantic divergence scores from a set of anchor event graphs are used to compose the vector representation of a target event graph. Divergence is measured through pre-trained Node-to-Edges encoder models, one for each anchor.
Cross-Graph Attention. To compare pairs of event graphs that may differ in size (node sets) and structure (edge sets), we need to learn an alignment between them. For this to happen, we employ an attention mechanism encoding a relaxed notion of graph isomorphism. This isomorphism attention bidirectionally aligns the nodes of a target graph against those of an anchor graph, ideally operating in the absence of a direct node mapping and drawing not necessarily one-to-one correspondences. It requires two separate attention networks. The first network, denoted as M T A , allows nodes in the target graph ( G T T ) to attend to the most structurally similar nodes in the anchor graph ( G A A ). Specifically, it assigns every node t i V ( G T ) a probability distribution (softmax function) over the nodes a j V ( G A ) , i.e., 1:N mapping. On the implementation level, we use a multiclass classifier:
P r ( a j | t i ) = e M T A ( t i , a j ) a k V ( G A ) e M T A ( t i , a k )
The second network, denoted as M A T , is a reverse attention network which maps a neighborhood in the anchor graph to a neighborhood in the target graph, i.e., N:N mapping. We implement it as a multi-label classifier with a sigmoid function:
P r ( t j | N ( a i ) ) = 1 1 + e M A T ( N ( a i ) , t j )
By wrapping the anchor event graph encoder with both attention networks, we are able to predict the neighbors of each node within a target graph—but utilizing the structure of the source graph. Such isomorphism attention allows for capturing higher-order structure between graphs, going beyond the immediate neighborhoods.
Attributes Consistency. As we know, event graphs are not defined only by their structure but also by the attributes of their nodes and edges. Accordingly, to learn an alignment that preserves semantics, we add regularizing losses to the attention and reverse-attention networks. Compared to DDGK, we add support for continuous attributes, indispensable in the event domain.
As for the node type τ v ( ) , we estimate a probability distribution of the discrete attributes over the target graph based on the learned attention scores, defined as follows:
Q τ v ( y i | t j ) = k M T A ( y i | a k ) P r ( a k | t j ) ,
where M T A ( y i | a k ) simply checks whether the predicted anchor nodes have label y i or not (1 or 0). To define the attention regularizing loss over the node types, we adopt the average cross-entropy loss with the aim of measuring the difference between the observed distributions of discrete attributes P r ( y i | t j ) and the inferred ones Q τ v ( y i | t j ) :
L τ v = 1 | V ( G T ) | j | V ( G T ) | i P r ( y i | t j ) l o g ( Q τ v ( y i | t j ) )
As for the edge type τ e ( ) , we set Q τ e ( y i | t j ) to the normalized attributes count over all edges connected to node t j . For instance, if a node t j has three edges with two of them labeled as “Theme” and the other as “Cause”, Q τ e ( T h e m e | t j ) = 0.67 . By replacing Q τ v with Q τ e in Equations (6) and (7), we create a regularization loss for edge discrete attributes.
As for the node text γ v ( ) , we minimize the following loss function, as a mean reduction of the cosine distance between the embedding of the textual labels of the nodes in the target and anchor graphs, weighted by their alignment score:
L γ v = 1 | V ( G T ) | × | V ( G A ) | j | V ( G T ) | i | V ( G A ) | c o s _ d i s t ( e m b ( γ v ( t j ) ) , e m b ( γ v ( a i ) ) ) P r ( a i | t j ) .
In this study, we use SciBERT’s pre-trained embeddings []. The key idea is to exploit the linguistic and domain knowledge learned by large biomedical language models, otherwise not included in the event graph.
Inspired by DeepEventMine (state-of-the-art in biomedical event extraction) [], we implement the e m b ( ) function with the subsequent representation m f , l , indicating a text span from the first word f to the last one l:
m f , l = v f , 1 ; i = f l j = 1 s i v i , j i = f l s i ; v l , s l ,
where [ ; ; ] denotes concatenation, while v i , j denotes the jth sub-word representation in the ith word. More specifically, v f , 1 is the first sub-word representation of the fth word, while v l , s l is the last sub-word representation of the lth word.
The presented regularization losses are also introduced for reverse attention networks. In this case, the distribution of attributes refers to the node’s neighborhood, and the node’s neighborhood edges are those appearing at 2-hops distance from the node.
Figure 7 shows a vivid example of augmented target graph encoder with cross-graph attention.
Figure 7. Anchor-based target event graph encoder for divergence prediction. Attention layers map the target event graph nodes onto the anchor graph, being aware of node and edge attributes. The first attention network ( M T A ) receives a one-hot encoding vector representing a node ( t i ) in the target graph and maps it onto the most structurally and semantically similar node ( a j ) in the anchor graph. The anchor event graph encoder, then predicts the neighbors of a j , N ( a j ) . Finally, the reverse attention network ( M A T ) takes N ( a j ) and maps them to the neighbors of t i , N ( t i ) .

4.2.3. Event Graph Divergence and Embedding

Here, we propose using the augmented encoder to define a measure of dissimilarity between a target event graph and an anchor one, from which derive a graph kernel function (Equations (1) and (2)). In other words, the distance measure is learned from the data (distance metric learning) in an unsupervised way and then converted into a kernel.
We note that learning a metric by measuring the divergence score between a pair of graphs G A and G T is possible. We refer to the encoder trained on an anchor graph G A as H G A and the divergence score given to the target graph G T as:
D ( G T | | G A )   = v i V ( G T ) j e i j E ( G T ) l o g P r ( v j | v i , H G A ) .
This operation is equivalent to multiplying the probabilities of correct prediction of the neighbors of each target node using the anchor event graph encoder wrapped with semantic and structural alignments. If two graphs are semantically and structurally similar, we expect their divergence to be correspondingly low. Given that H G A is not a perfect predictor of G A structure, we can safely assume that D ( G A | | G A ) 0 . To rectify this problem and ensure identity, we define:
D ( G A | | G T )   = D ( G A | | G T ) D ( G A | | G A ) ,
which sets D ( G A | | G A ) to zero. However, this definition is not symmetric since D ( G T | | G A ) might not necessarily equal to D ( G A | | G T ) . If symmetry is required, we can use:
D ( G A , G T ) = D ( G A | | G T ) + D ( G T | | G A ) .
A recent work, called D2KE (distances to kernels and embeddings) [], proposes a general methodology for deriving a positive-definite kernel from any given distance function d : G × G R . We adopt D2KE to develop an unsupervised neural network model for learning graph-level embeddings based on the architecture illustrated so far. We select a subset of training samples as a held-out representative set and use distances to points in the set as the feature function. Definitely, we establish a vector space where each dimension corresponds to one event graph in the anchor set. Target event graphs are represented as points in this vector space:
Ψ ( G T ) = [ D ( G T | | G A 0 ) , D ( G T | | G A 1 ) , . . . , D ( G T | | G A M ) ] .
To create a kernel out of our event graph embeddings, we use the Euclidean distance measure as outlined in Equation (2). In essence, our graph kernel identifies topological and semantic associations between individual event nodes.

4.2.4. Training

We extend the original implementation using TensorFlow and Adam optimizer. Expressly, we train M anchor graph encoders, then we freeze the parameters and add the two attention networks for t a r g e t a n c h o r mapping and the regularizing losses to preserve semantics. Finally, the augmented encoders are trained on each target event graph in input, represented as an adjacency matrix accompanied by an additional matrix for each attribute type ( τ v , τ e , and γ v ).

4.2.5. Scalability

As typical of kernel approaches, our method relies on pairwise similarity and requires N × M computations for scoring each of the N target event graphs against each of the M anchor event graphs (i.e., quadratic time complexity in terms of the number and size of the graphs). However, in Section 5, we demonstrate that it is not necessary to compute the entire graph kernel matrix to achieve high performance. Best results were generally obtained using only 64 anchors (embedding dimensions) out of a population with more than 1000 biomedical events. Resuming Al-Rfou’s work [], the total computation cost for training the anchor graph encoders and the attention-augmented target ones can be approximated as Θ ( N × M × T × ( V × d + k × d 2 ) ) , where M can be much lower than N. In this formulation, T is the maximum between the encoding and scoring epochs, V is the average number of nodes, d is the embedding and hidden layer size, and k is the maximum between the number of encoder hidden layers and attention ones.
The embeddings of event graphs in a large database can be precomputed and indexed, enabling efficient retrieval with fast nearest neighbor search data structures like locality sensitive hashing [].
If graph populations were to change over time, for example, new event graphs expressed in the biomedical literature, it would not be necessary to re-calculate the embeddings but only to estimate the divergence between the latest events and the anchor ones (using the previously trained encoders).

4.3. Hardware and Software Setup

All the experiments were conducted on a server having a Titan Xp GPU with 12 GB of dedicated memory, 4 CPU cores (Intel i5-6400 2.70 GHz processor), 24 GB of RAM, and running Ubuntu 16.04.6 LTS. For minor tasks and tests, we moved on Google Colab.

5. Experiments

Once the event embeddings are computed using DDEGK, they could be leveraged for numerous downstream graph analytics tasks, such as classification, clustering, and similarity search. In this section, we perform a series of quantitative and qualitative experiments to demonstrate the effectiveness of our method compared to the others. First, we show that event graph embeddings learned by DDEGK represent a sufficient feature set to predict the biological area and fine-grained type of a large number of scientific interactions. Then, we prove that representations of events similar in structure and semantics are correctly grouped, assessing the geometrical quality of the kernel space through clustering indices and visualization techniques. Moreover, we present the advantage of event representations over traditional sentence embeddings in the semantic textual similarity (STS) task. Finally, we underline the benefits of cross-graph attention for interpretability purposes.

5.1. Event Graph Classification

Our learned event representations respect both structure and discrete/continuous attributes. So, they can be used for graph classification tasks where the graph structure, node attributes, and edge attributes convey meaning or function. To demonstrate this, we use DDEGK representations of the event datasets presented in Section 4.1 as features for graph label predictions, learning a mapping function between event embeddings and classes. Specifically, we evaluate the meaningfulness of the learned representations with two classification tasks: (i) predicting the principal event type of each graph and, therefore, the nature of the expressed interaction; (ii) predicting the biological area of each graph, which is roughly represented by the belonging dataset. The second point refers to the “bio_all” shared space (events with heterogeneous datasets).

5.1.1. Baseline Methods

We compare the performance of DDEGK against a variety of graph embedding models—both supervised and unsupervised—that preserve different event graph properties. These models are representative of the main types of embedding techniques and can be categorized into four groups.
  • Node embedding flat pooling. Each event is represented as the unsupervised aggregation of its constituent node vectors. We experiment with multiple unweighted flat pooling strategies, namely, mean, sum, and max. As for node representations, we examine (i) contextualized word embeddings from large-scale language models pre-trained on scientific and biomedical texts, (ii) node2vec []. The first point is realized by applying SciBERT [] and BioBERT [] (768 embedding size) on trigger and entity text spans: a common approach in the event-GRL area []. It condenses the semantic gist of an event based on the involved entities; it totally ignores structure and argument roles. In contrast, node2vec is a baseline for sequential methods which efficiently trade off between different proximity levels. The default walk length is 80, the number of walks per node is 10, return and in-out hyper-parameters are 1, and embedding size is 128.
  • Node embedding + CNN. We use node2vec-PCA [] (d = 2, i.e., one channel), which composes graph matrices from node2vec and then applies a CNN for supervised classification.
  • Whole-graph embedding. We use graph2vec [] to generate unsupervised structure-aware graph-level representations for our biomedical events. We work on labeled graphs, with labels denoting numerical identifiers for event types and entities. Default embedding size is 128, and the number of epochs is 100.
  • GNN + supervised pooling. We use DGCNN [], an end-to-end graph classification model made by GCNs with a sort pooling layer to derive permutation invariant graph embeddings. 1D-CNN then extracts features along with a fully-connected layer. Default k (normalized graph size) is 35.
During the experiments, we used the default hyper-parameter setting suggested by the authors and detailed above.

5.1.2. Hyperparameters Search

We first obtain the embeddings of all the event graphs and then feed them to SVM as a kernel classifier (except for DGCNN since it is a supervised algorithm). We split event samples into train, validation, and test sets for overfitting avoidance. To choose DDEGK hyperparameters (Table 2), we perform grid searches for each dataset. At this juncture, we empirically note that, on the datasets under examination, the best results on average are obtained with a semantic-structure ratio of about 20:1, weighting the contribution of each label equally. For SVM, we use the scikit-learn implementation and 10-fold cross validation. We vary the kernel between {linear,rbf,poly,sigmoid} and the regularization coefficient C between 10 and 10 9 . We choose the combination of DDEGK and classifier hyperparameters that maximize the accuracy on the dev set. Furthermore, we experiment with two different anchor graph choices: (i) a random subset of the original graph set, and (ii) a version balanced with respect to the event types. The second is meant to produce more interpretable vectors, where each event is represented as a mixture of its types (similar to topic modeling []). To keep computation tractable, tested dimensions during sampling are 32, 64, and 128.
Table 2. Hyperparameter values tested during our grid search for DDEGK graph representations.

5.1.3. Results

Graph classification results are shown in Table 3. We measure the test’s accuracy with the F1-score, i.e., the harmonic mean of precision and recall (Equation (14)):
F 1 = 2 × precision × recall precision + recall
Table 3. Average accuracy (support-weighted F1-score) in ten-fold cross validation on event type and dataset classification tasks. Methods are grouped by their approach and level of supervision during graph representation learning. The highest F1-score for each dataset is shown in bold.
We see that DDEGK performs surprisingly well for an unsupervised method with no engineered features. Our model significantly outperforms the baselines on principal event type classification tasks, achieving higher average accuracy in each scenario (i.e., absolute margins of 22.46 and 6.86 over the best unsupervised and supervised baseline, respectively). For dataset (biomedical area) classification, it also achieves competitive results, only being exceeded by DGCNN (supervised). The performance gap between purely semantic and structural approaches (such as pooling on word embedding compared to node2vec and node2vec-PCA) clearly exhibits the pivotal role of node/edge labels within event modeling. We perceive graph2vec as a compromise between the two. Therefore, the use of auxiliary information seems to be the main driver of classification quality. The average on the embeddings generated by BERT-based models turns out to be the most effective pooling solution, with SciBERT slightly more solid than BioBERT. Considering both node connectivity and node/edge semantics, the deep gap between our solution and SciBERT AVG confirms the validity of a hybrid embedding strategy. Performance tends to decrease in proportion to graph size and label variety. Expectedly, node2vec proves to be highly ineffective on GRO13 due to the low structural variance and limited contextual information. The partial annotation overlap between datasets [] justifies the lower scores within event spaces having heterogeneous sources. The two strategies for choosing anchor events are comparable in terms of performance.

5.2. Between-Graph Clustering

Graph clustering is particularly useful for discovering communities and has many practical applications, such as grouping proteins with similar properties. In our context, it allows the automatic identification of similar events mentioned in the biomedical literature, categorizing them and allowing their quantification (e.g., how many times a specific protein-symptom interaction has been reported). We compare graph embedding methods on biomedical events by examining their clustering quality to understand the global structure of the encoding spaces quantitatively. Exactly, we verify the close location in the vector space of instances sharing the same type or biomedical area of origin. Since we treat whole-graph embeddings, we perform between-graph clustering: starting from a large number of graphs, we attempt to cluster them (not their components) based on underlying structural and semantic behaviors. In contrast with the conventional within-graph clustering, where the goal is partitioning vertices in single graphs [], our objective is notoriously more challenging because of the need to match substructures. This setting is very reminiscent of the issues for which DL solutions apply.
Our evaluation contemplates two metrics: silhouette score (SS) and adjusted rand index (ARI). The first measures how similar an event is to its own cluster (cohesion, i.e., intra-cluster distance) compared to the others (separation, i.e., nearest-cluster distance). The second computes the corrected-for-chance similarity between two event clusterings, counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. Both lie in the range [ 1 , 1 ] , where high values indicate better results. In our study, we use the Euclidean distance. Rand index calculation is performed considering the best-predicted clustering upon ten consecutive runs of the K-means algorithm (scikit-learn implementation, with k = # e v e n t _ t y p e s or k = # d a t a s e t s ). Experimental results are presented in Table 4.

5.2.1. Results

At the outset, we can see that our solution performs the best on both metrics. For ease of understanding, next, we report percentage values. As for ARI, DDEGK averagely surpasses SciBERT, node2vec, and graph2vec by more than 10.7%, 14.3%, and 11.5%, respectively. As for SS, the gap settles on 3.4%, 23.3%, and 3.8%. This reinforces the findings inferred from the classification experiments; the scores reached by models on the clustering task are in line with that of classification. Significantly, clustering makes the impact deriving from a different anchor choice evident. The selection of random anchors per event type leads to a space with better geometric characteristics but less effective clustering than the entirely random counterpart.

5.2.2. Influence of Embedding Dimensions

In the concrete, finding the optimal embedding dimension is not easy []. A longer embedding tends to preserve more information about the original event graph at the cost of more storage requirement and computation time, but it also risks maintaining noise. On the other hand, a lower dimension representation is more resource-efficient but risks losing critical information about the input graph with a significant performance drop. Researchers need to make a trade-off based on the requirements; GRL articles usually report that an embedding size between 128 and 256 is sufficient for most tasks. We study the effect of sub-sampling the dimensions of our embedding space on the quality of event graph classification and clustering.
Table 4. Between-graph clustering results according to the silhouette score (white row) and range index (gray row). The best results for each dataset are shown in bold.
Table 4. Between-graph clustering results according to the silhouette score (white row) and range index (gray row). The best results for each dataset are shown in bold.
MethodEvent Type ClusteringDataset
Clustering
ST09GE11EPI11ID11MLEEGE13CG13PC13GRO13BIO_ALLAVG
SciBERT (AVG)−0.00915−0.009400.040820.01347−0.01139−0.007840.00087−0.01251−0.00343−0.01974−0.00183−0.01123
0.036380.066410.066440.090050.058100.040790.066440.040220.076690.040750.058230.05373
node2vec (AVG)−0.48715−0.48165−0.58339−0.58004−0.46614−0.57659−0.51334−0.49366−0.54964−0.58171−0.53133−0.27637
−0.02987−0.02539−0.00234−0.02405−0.05510−0.03748−0.04139−0.03689−0.03148−0.03830−0.032230.00068
graph2vec−0.25687−0.20579−0.14446−0.27879−0.35015−0.28666−0.39072−0.36219−0.41060−0.45318−0.31394−0.16632
0.037890.046790.090580.145080.033940.074830.028830.021230.014430.024120.051770.02761
DDEGK (ours)
w/random anchors0.213710.172810.279850.092670.085720.22813−0.05122−0.045870.05914−0.321400.071080.00686
0.331770.341380.541290.446210.164410.516220.172420.130660.390700.143490.317860.14349
w/random anchors per type0.234280.245980.287230.151410.101360.312700.081230.018090.08047−0.290360.122240.00007
0.356250.386220.397650.296610.154410.433870.151850.092960.237320.083200.259030.28101
By performing multiple tests on three bioinformatics datasets with an average of 570 instances, the authors of DDGK [] achieved stable and competitive results using less than 20% of the graphs as anchors. To complement these findings, we explore computational complexity and scalability in much more detail. We construct different anchor graph sets from the original ones, adopting the anchor selection strategies listed in Section 5.1.2. For each configuration, we learn divergence scores against all target graphs and use the reduced embeddings as features to predict graph categories and clusters. Results are summarized in Figure 8.
Figure 8. Effect of sub-sampling source graphs on event graph classification and clustering tasks above each biomedical dataset (colored line). We vary the number of source graphs between 32, 64, and 128, employing two different anchor strategies (marks).
We observe that classification scores on individual datasets remain substantially unchanged as dimensions vary with a random anchor choice. Only about a 2% drop in performance when the embedding dimension size drops from 128 to 32. On average, the highest F1-score for event type classification is obtained with an embedding size of 64, meaning that a high number of anchors is not required. So, very few dimensions are needed to obtain our final results detailed in Section 5.1 (far fewer than those expected by baselines). A higher number of dimensions is instead necessary for the more complex task of classifying heterogeneous event graphs. In this case, performance grows in step with embedding dimensionality, where more detailed representations capture several input nuances. Unlike a fully-random choice, anchors balanced with respect to the event type tend to perform worse with increasing dimensions, sometimes also experiencing substantial declines as in the case of CG13 and ID11. Relating the classification and clustering results facilitates understanding the overall graph embedding goodness and the correct hyperparameters setting. For example, DDEGK achieves high classification accuracy on PC13, but ARI is worse; while, on EPI11, despite the classification being perfect with all dimensions, the encoding space at 64 appears better at the clustering level.

5.3. Visualization

Embedding visualization is helpful to qualitatively realize how well DDEGK learns from the event graph structure and node/edge attributes. Since our representations capture events’ similarities by vicinity in the Euclidean space, the performance of event graph embeddings can be visually evaluated by the aggregation degree of labeled events (colorful dots). Ideally, vectors for event graphs in the same class should aggregate highly in a local region of the embedding space. Meanwhile, event vectors belonging to different classes should be as separable as possible to bolster downstream machine learning models. Hence, we conduct a visual intuitively inspection of how biomedical event graphs are grouped in the t-SNE 2D projection. For space reasons, we only show the plot of the two-dimensional manifold associated with EPI11, the best-encoded space in terms of clustering indices (Figure 9). DDEGK embeds event graphs that have similar types nearby in the embedding space, giving nicely distinguishable clusters. Notice that there may be different clusters for the same event type, oftentimes due to different structure or attribute sets. An event type within a schema can specify the cardinality of the expected participants, whose instances may sometimes be optional or multiple. It follows that events of the same type may have a very varied structure.
Figure 9. Embeddings of 1002 biomedical events from the EPI11 dataset, projected into a 2D space by t-SNE. Event graphs belonging to the same type are embedded closer to each other. In the case of structural or semantic differences, there can be multiple clusters for the same event type, as qualitatively shown for Methylation.
In Figure 10, we demonstrate three examples of event retrieval. The first shows a simple case where the query is unbounded, meaning that it contains an entity with a vague description “cells”, and the graphs returned detail that participant. In the second and third examples, the most similar events respect the structure and attributes of the reference (same types of participants and role played in the interaction), suggesting related or additional entities. Accordingly, it can be assumed that similar events refer to events with high-level semantic relationships rather than just identity relationships.
Figure 10. Query case studies on biomedical event graphs of different types and sizes. In each demo, the first column depicts the query and the others the similarity ranking of the retrieved graphs (and their mention). DDEGK correctly returns similar events in the structure or semantics of nodes/edges, also managing synonyms and acronyms thanks to SciBERT.

5.4. Cross-Graph Attention

We seek to understand how two biomedical event graphs are related to each other by DDEGK. Since there is no gold-standard correspondence between the two sets of nodes, we cannot quantitatively evaluate the goodness of the cross-graph alignments. However, the qualitative example in Figure 11—which is not intended as a conclusive evaluation of the system—show how DDEGK typically pays the most attention to the correct entities or triggers while ignoring the irrelevant ones.
Figure 11. Qualitative example of DDEGK cross-graph attention, comparing two biomedical event graphs from CG13. Attention weights are visualized as a heatmap, while stronger node–node alignments are directly reported on the event graphs (dashed blue lines).

5.5. Semantic Textual Similarity

Language is highly ambiguous, with multiple ways to express the same concept unit, frequent high-level linguistic phenomena, and untold background knowledge. The superficial organization of a sentence is almost irrelevant for the identification of its real and deeper semantic content, determined instead by techtogrammatics []. Events allow us to remove noise and focus only on unambiguous relational knowledge involving relevant entities. To take a step in the research direction of comparing events and plain sentences within NLP applications, we also include an experiment on STS.
We consider three datasets on biomedical sentence similarity: BIOSSES [], MedSTS [], and CTR [], for a total of 271 sentence pairs (499 unique sentences) and normalized scores in [ 0 , 1 ] . We process each sentence with seven pre-trained DeepEventMine models (GE11, EPI11, ID11, MLEE, GE13, CG13, PC13) [], managing to extract 184, 8, 142, 331, 198, 392, and 236 events, respectively. In addition, 324 sentences have at least one event (264 with more than one), yielding a dataset of 127 sentence pairs (80:20 train-test ratio), event counterparts, and similarity scores. Subsequently, we convert events into event graphs and calculate entity embeddings as described in Section 4.1.1 and Section 4.2. By removing duplicated instances, we move from 1491 to 890 event graphs. Then, we apply DDEGK on the event graph population (128 random anchors, 4 node embedding dimensions, 3 encoding layers, 300 encoding and scoring epochs, 10 1 learning rate, [7,7,7] γ v , τ v , τ e preserving loss coefficients), with a mean pooling in case of multiple events per sentence. Finally, we calculate semantically meaningful sentence embeddings with SBERT [] (all-mpnet-base-v2 from HuggingFace, https://huggingface.co/sentence-transformers/stsb-mpnet-base-v2, accessed on 11 December 2021).
We define a simple regression model composed of two fully connected linear layers (hidden size 1000) with a sigmoid in between, using the Adam optimizer, the L1 loss, 1000 epochs, and a learning rate equal to 10 4 . We train it on three input embedding configurations: (i) sentences only (SBERT), (ii) events only (DDEGK), (iii) sentences + events (concatenated in this order). Remarkably, we register a mean squared error on the test set of 0.1379 for (i), 0.1216 for (ii), and 0.1598 for (iii). These results corroborate the effectiveness of events in grasping the essential knowledge mentioned in a text document. DDEGK event embeddings lead to better performance on downstream STS tasks than state-of-the-art approaches based on token sequences, using 6x lower dimensionality.

6. Discussion

Here, we briefly discuss findings and their implications, DDEGK limitations, and applicability to new scenarios.
DDEGK is an unsupervised and inductive method capable of mapping events into low-dimensional vectors, reflecting their structural and semantic similarities. It is designed to be highly interpretable, with a cross-graph isomorphic attention mechanism trained to preserve node and edge attributes. By merging DL architectures and NLP with symbolic structures and graph theory, it represents a powerful tool for automatically recognizing similarities between biomedical interactions mentioned by researchers in scientific publications, enabling their aggregation, quantification, and retrieval. It leverages deep graph kernels, but, as experimentally verified, it does not require computing the entire kernel matrix, providing greater scalability. Thus, learned embeddings make event-centered operations simpler and faster than comparable operations on graphs. We also argue that the representation of scientific knowledge in the form of events instead of textual documents can constitute a promising research direction for the performance improvement of different NLP tasks, especially when combined with memory-equipped neural networks [,,] and deep metric learning [].
On the other side, there are also some problems and limits. If it is true that non-individual representation learning leads to greater accuracy in estimating event similarity, it is equally true that kernel-based approaches are more time-consuming than other types of embedding. Furthermore, the choice of anchor graphs is crucial for the solution’s effectiveness and could be wisely carried out through an ad-hoc model. More specifically, we believe that a basis approximation could be obtained with a neural network [], choosing as anchor events those whose weights are more orthogonal to each other and capture maximum variability; we leave this possibility as a future direction. Non-supervision based on comparison with some references makes it possible to calculate valid event embeddings even in the absence of paired datasets with annotated scores. However, it tends to bring out higher-level similarities than supervised approaches, which are instead conceivably capable of capturing differences at a greater level of detail. In addition, extending the attention mechanism beyond the node-to-node alignment could improve results and better recognize how two events are similar. Lastly, events may be accompanied by modifiers that reshape the interaction meaning, like negation (e.g., “not”) and speculation (e.g., “may”, “might”, “could”). Such properties have not been considered within this paper, but they can be seen as additional trigger attributes to boost performance.
DDEGK is highly flexible and can be easily applied to new datasets, tasks, and domains, possibly even on generic labeled graphs. For instance, one of the domains on which we pour more expectations is analyzing social posts shared by patients [,,,], e.g., detect conversational threads [], aggregate and quantify the number of times an adverse drug reaction or symptom manifestation is reported under certain conditions. Four other applications that can benefit from event-enriched knowledge modeling are text classification [], long document summarization [], tutoring [,] and recommender [,] systems. Deploying our method on a new dataset essentially takes three steps: (i) select a pre-trained language model to encode node and edge text attributes; (ii) define a strategy for selecting a population subset with anchor role (which size in principle depends on graph characteristics and task complexity); and (iii) train a graph encoder for each anchor and use them to calculate target-anchor divergences from which derive whole-graph embeddings. As for (iii), the encoder architectural complexity and training duration should be commensurate with graph sizes. The intent is to avoid overfitting within the anchor graph encoder, which would otherwise reduce its usefulness in recognizing similar target graphs. In the datasets covered by this study, the best results were obtained with encoder layers and encoding/scoring epochs set to 3 and 300. The coefficients that regulate the impact of structure and semantics (i.e., discrete and continuous attributes of nodes and edges) should be adjusted according to their meaning for the specific application under consideration, possibly making them a training target. For example, in the case of biomedical events, the best setting was obtained by assigning the semantics a weight 21× greater than the structure and giving equal importance to node types, edge labels, and node textual attributes.

7. Conclusions

This paper presented Deep Divergence Event Graph Kernels, an unsupervised technique for learning whole-graph representations of biomedical events and their similarities. Our method compares the event graphs against a set of anchor ones without requiring feature engineering. Based on their structural and semantic divergence, also measured by SciBERT, it learns task-agnostic embeddings and a graph kernel above them. Using an isomorphic attention mechanism, we align nodes of two graphs without requiring a known correspondence between vertex ids, but only preserving the structure and consistency of node/edge attributes (both discrete and continuous). Our experimental analysis shows that, despite being trained with only graph edges, the learned representations encode numerous local and global information, resulting in a powerful embedding space. Moreover, when learned event embeddings are used as features in tasks like graph classification and clustering, we find them superior or competitive with those produced by other state-of-the-art solutions (even supervised). Furthermore, the excellent recognition of the semantic similarity between biomedical sentences highlights how incorporating dense and compact event-based features in NLP systems may be a promising research direction. Ultimately, in addition to being expressive, DDEGK models are incredibly informative. The control over anchor events and the presence of cross-graph attention weights allow a high level of insight into the alignment between two event graphs, making their similarity interpretable.
As future work, we will deepen (i) the automatic identification of anchor events orthogonal to each other, (ii) the passage from nodes to subgraphs as alignment granularity, (iii) the creation of multi-modal spaces combining events and text, (iv) the integration of events and language models, (v) deep neural networks with event-based memories, and (vi) the application to events not mentioned in the biomedical literature but expressed by patients and caregivers within social posts.

Author Contributions

Conceptualization, G.F., G.M. and A.C.; methodology, G.M. and G.F.; software, G.F. and G.C.; validation, G.F.; formal analysis, G.F., G.M., A.C. and G.C.; investigation, G.F.; resources, G.M.; data curation, G.F.; writing—original draft preparation, G.F.; writing—review and editing, G.M. and A.C.; visualization, G.F.; supervision: G.M. and A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

For replication purposes, code and datasets are publicly available at https://github.com/disi-unibo-nlu/ddegk, accessed on 11 December 2021.

Acknowledgments

We thank Eleonora Bertoni for her precious help in preparing the datasets, implementing the baseline, and conducting the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Distribution of event types (root ones, in case of nesting) and graph sizes within sampled datasets.

References

  1. Landhuis, E. Scientific literature: Information overload. Nature 2016, 535, 457–458. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Frisoni, G.; Moro, G.; Carbonaro, A. A Survey on Event Extraction for Natural Language Understanding: Riding the Biomedical Literature Wave. IEEE Access 2021, 9, 160721–160757. [Google Scholar] [CrossRef]
  3. Pyysalo, S.; Ohta, T.; Miwa, M.; Cho, H.; Tsujii, J.; Ananiadou, S. Event extraction across multiple levels of biological organization. Bioinformatics 2012, 28, 575–581. [Google Scholar] [CrossRef] [PubMed]
  4. Kim, J.; Ohta, T.; Pyysalo, S.; Kano, Y.; Tsujii, J. Overview of BioNLP’09 Shared Task on Event Extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, BioNLP@HLT-NAACL 2009–Shared Task, Boulder, CO, USA, 5 June 2009; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 1–9. [Google Scholar]
  5. Kim, J.; Pyysalo, S.; Ohta, T.; Bossy, R.; Nguyen, N.L.T.; Tsujii, J. Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, Portland, OR, USA, 24 June 2011; Tsujii, J., Kim, J., Pyysalo, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 1–6. [Google Scholar]
  6. Nédellec, C.; Bossy, R.; Kim, J.; Kim, J.; Ohta, T.; Pyysalo, S.; Zweigenbaum, P. Overview of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; Nédellec, C., Bossy, R., Kim, J., Kim, J., Ohta, T., Pyysalo, S., Zweigenbaum, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1–7. [Google Scholar]
  7. Henry, S.; McInnes, B.T. Literature Based Discovery: Models, methods, and trends. J. Biomed. Inf. 2017, 74, 20–32. [Google Scholar] [CrossRef]
  8. Björne, J.; Ginter, F.; Pyysalo, S.; Tsujii, J.; Salakoski, T. Complex event extraction at PubMed scale. Bioinformatics 2010, 26, 382–390. [Google Scholar] [CrossRef] [Green Version]
  9. Miwa, M.; Ohta, T.; Rak, R.; Rowley, A.; Kell, D.B.; Pyysalo, S.; Ananiadou, S. A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text. Bioinformatics 2013, 29, 44–52. [Google Scholar] [CrossRef] [Green Version]
  10. Zhang, T.; Chen, M.; Bui, A.A.T. Diagnostic Prediction with Sequence-of-sets Representation Learning for Clinical Events. In Proceedings of the 18th International Conference on Artificial Intelligence in Medicine AIME 2020, Minneapolis, MN, USA, 25–28 August 2020; Michalowski, M., Moskovitch, R., Eds.; Lecture Notes in Comput. Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12299, pp. 348–358. [Google Scholar] [CrossRef]
  11. Berant, J.; Srikumar, V.; Chen, P.; Linden, A.V.; Harding, B.; Huang, B.; Clark, P.; Manning, C.D. Modeling Biological Processes for Reading Comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, Doha, Qatar, 25–29 October 2014; A Meeting of SIGDAT, a Special Interest Group of the ACL. Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014. [Google Scholar]
  12. Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef] [Green Version]
  13. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [Green Version]
  14. Blumenthal, D.B.; Gamper, J. On the exact computation of the graph edit distance. Pattern Recognit. Lett. 2020, 134, 46–57. [Google Scholar] [CrossRef]
  15. Ma, G.; Ahmed, N.K.; Willke, T.L.; Yu, P.S. Deep graph similarity learning: A survey. Data Min. Knowl. Discov. 2021, 35, 688–725. [Google Scholar] [CrossRef]
  16. Chen, F.; Wang, Y.C.; Wang, B.; Kuo, C.C.J. Graph representation learning: A survey. APSIPA Trans. Signal Inf. Process. 2020, 9, E15. [Google Scholar] [CrossRef]
  17. Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 3835–3845. [Google Scholar]
  18. Sousa, R.T.; Silva, S.; Pesquita, C. Supervised biomedical semantic similarity. bioRxiv 2021. [Google Scholar] [CrossRef]
  19. Al-Rfou, R.; Perozzi, B.; Zelle, D. DDGK: Learning Graph Representations for Deep Divergence Graph Kernels. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L., Eds.; ACM: Baltimore, MD, USA, 2019; pp. 37–48. [Google Scholar] [CrossRef] [Green Version]
  20. Cao, S.; Lu, W.; Xu, Q. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 891–900. [Google Scholar]
  21. Ou, M.; Cui, P.; Pei, J.; Zhang, Z.; Zhu, W. Asymmetric Transitivity Preserving Graph Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1105–1114. [Google Scholar]
  22. Qiu, J.; Dong, Y.; Ma, H.; Li, J.; Wang, K.; Tang, J. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018; pp. 459–467. [Google Scholar]
  23. Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  24. Makarov, I.; Kiselev, D.; Nikitinsky, N.; Subelj, L. Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput. Sci. 2021, 7, e357. [Google Scholar] [CrossRef]
  25. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
  26. Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
  27. Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 135–144. [Google Scholar]
  28. Chen, H.; Perozzi, B.; Hu, Y.; Skiena, S. HARP: Hierarchical Representation Learning for Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2127–2134. [Google Scholar]
  29. Abu-El-Haija, S.; Perozzi, B.; Al-Rfou, R.; Alemi, A.A. Watch Your Step: Learning Node Embeddings via Graph Attention. In Proceedings of the 2018 Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 9198–9208. [Google Scholar]
  30. Rozemberczki, B.; Davies, R.; Sarkar, R.; Sutton, C. GEMSEC: Graph embedding with self clustering. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 65–72. [Google Scholar]
  31. Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; Chang, E.Y. Network Representation Learning with Rich Text Information. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Menlo Park, CA, USA, 2015; pp. 2111–2117. [Google Scholar]
  32. Ahn, S.; Kim, M.H. Variational Graph Normalized Auto-Encoders. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021. [Google Scholar]
  33. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  34. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  35. Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
  36. Cai, H.; Zheng, V.W.; Chang, K.C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Trans. Knowl. Data Eng. 2018, 30, 1616–1637. [Google Scholar] [CrossRef] [Green Version]
  37. Chami, I.; Abu-El-Haija, S.; Perozzi, B.; Ré, C.; Murphy, K. Machine Learning on Graphs: A Model and Comprehensive Taxonomy. arXiv 2020, arXiv:2005.03675. [Google Scholar]
  38. Nikolentzos, G.; Meladianos, P.; Vazirgiannis, M. Matching Node Embeddings for Graph Similarity. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; AAAI Press: Menlo Park, CA, USA, 2017; pp. 2429–2435. [Google Scholar]
  39. Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
  40. Tixier, A.J.; Nikolentzos, G.; Meladianos, P.; Vazirgiannis, M. Graph Classification with 2D Convolutional Neural Networks. In Lecture Notes in Computer Science, Proceedings of the International Conference on Artificial Neural Networks ICANN, Munich, Germany, 17–19 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11731, pp. 578–593. [Google Scholar]
  41. Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.F.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
  42. Atamna, A.; Sokolovska, N.; Crivello, J.C. SPI-GCN: A simple permutation-invariant graph convolutional network. Available online: https://hal.archives-ouvertes.fr/hal-02093451/ (accessed on 11 December 2021).
  43. Zhang, J. Graph Neural Distance Metric Learning with Graph-Bert. arXiv 2020, arXiv:2002.03427. [Google Scholar]
  44. Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.L.; Leskovec, J. Hierarchical Graph Representation Learning with Differentiable Pooling. In Proceedings of the 2018 Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 4805–4815. [Google Scholar]
  45. Zhang, M.; Cui, Z.; Neumann, M.; Chen, Y. An End-to-End Deep Learning Architecture for Graph Classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Menlo Park, CA, USA, 2018; pp. 4438–4445. [Google Scholar]
  46. Gao, H.; Ji, S. Graph U-Nets. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 2083–2092. [Google Scholar]
  47. Lee, J.; Lee, I.; Kang, J. Self-Attention Graph Pooling. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 3734–3743. [Google Scholar]
  48. Ahmadi, A.H.K.; Hassani, K.; Moradi, P.; Lee, L.; Morris, Q. Memory-Based Graph Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  49. Niepert, M.; Ahmed, M.; Kutzkov, K. Learning Convolutional Neural Networks for Graphs. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 2014–2023. [Google Scholar]
  50. Adhikari, B.; Zhang, Y.; Ramakrishnan, N.; Prakash, B.A. Distributed Representations of Subgraphs. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 111–117. [Google Scholar]
  51. Narayanan, A.; Chandramohan, M.; Venkatesan, R.; Chen, L.; Liu, Y.; Jaiswal, S. graph2vec: Learning Distributed Representations of Graphs. arXiv 2017, arXiv:1707.05005. [Google Scholar]
  52. Liu, S.; Demirel, M.F.; Liang, Y. N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules. In Proceedings of the 2019 Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8464–8476. [Google Scholar]
  53. Bai, Y.; Ding, H.; Bian, S.; Chen, T.; Sun, Y.; Wang, W. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, Australia, 11–15 February 2019; pp. 384–392. [Google Scholar]
  54. Domeniconi, G.; Moro, G.; Pasolini, R.; Sartori, C. A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf. In Proceedings of the International Conference on Data Management Technologies and Applications, Colmar, France, 20–22 July 2015; pp. 39–58. [Google Scholar] [CrossRef]
  55. Papadimitriou, P.; Dasdan, A.; Garcia-Molina, H. Web graph similarity for anomaly detection. J. Internet Serv. Appl. 2010, 1, 19–30. [Google Scholar] [CrossRef] [Green Version]
  56. Faloutsos, C.; Koutra, D.; Vogelstein, J.T. DELTACON: A Principled Massive-Graph Similarity Function. In Proceedings of the 2013 SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 162–170. [Google Scholar]
  57. Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A Survey on Knowledge Graphs: Representation, Acquisition and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–21. [Google Scholar] [CrossRef]
  58. Cai, L.; Wang, W.Y. KBGAN: Adversarial Learning for Knowledge Graph Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT 2018, New Orleans, LA, USA, 2–4 June 2018; pp. 1470–1480. [Google Scholar]
  59. Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-relational Data. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NA, USA, 5–10 December 2013; pp. 2787–2795. [Google Scholar]
  60. Yang, B.; Yih, W.; He, X.; Gao, J.; Deng, L. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  61. Nickel, M.; Rosasco, L.; Poggio, T.A. Holographic Embeddings of Knowledge Graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA, 12–17 February 2016; AAAI Press: Palo Alto, CA, USA, 2016; pp. 1955–1961. [Google Scholar]
  62. Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; Bouchard, G. Complex Embeddings for Simple Link Prediction. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 2071–2080. [Google Scholar]
  63. Socher, R.; Chen, D.; Manning, C.D.; Ng, A.Y. Reasoning With Neural Tensor Networks for Knowledge Base Completion. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NA, USA, 5–10 December 2013; pp. 926–934. [Google Scholar]
  64. Ristoski, P.; Paulheim, H. RDF2Vec: RDF Graph Embeddings for Data Mining. In Proceedings of the International Semantic Web Conference (ISWC), Kobe, Japan, 17–21 October 2016; Volume 9981, pp. 498–514. [Google Scholar]
  65. Dettmers, T.; Minervini, P.; Stenetorp, P.; Riedel, S. Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018; pp. 1811–1818. [Google Scholar]
  66. Schlichtkrull, M.S.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. In Lecture Notes in Computer Science, Proceedings of the 2018 Extended Semantic Web Conference (ESWC), Crete, Greece, 3–7 June 2018; Springer: Heidelberg, Germany, 2018; Volume 10843, pp. 593–607. [Google Scholar]
  67. Xie, R.; Liu, Z.; Jia, J.; Luan, H.; Sun, M. Representation Learning of Knowledge Graphs with Entity Descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Phoenix, AZ, USA, 12–17 February 2016; AAAI Press: Palo Alto, CA, USA, 2016; pp. 2659–2665. [Google Scholar]
  68. Wang, Q.; Huang, P.; Wang, H.; Dai, S.; Jiang, W.; Liu, J.; Lyu, Y.; Zhu, Y.; Wu, H. CoKE: Contextualized Knowledge Graph Embedding. arXiv 2019, arXiv:1911.02168. [Google Scholar]
  69. Hu, L.; Zhang, M.; Li, S.; Shi, J.; Shi, C.; Yang, C.; Liu, Z. Text-Graph Enhanced Knowledge Graph Representation Learning. Front. Artif. Intell. 2021, 4, 697856. [Google Scholar] [CrossRef]
  70. Grohe, M. word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Portland, OR, USA, 14–19 June 2020; pp. 1–16. [Google Scholar]
  71. Wu, L.; Yen, I.E.; Xu, F.; Ravikumar, P.; Witbrock, M. D2KE: From Distance to Kernel and Embedding. arXiv 2018, arXiv:1802.04956. [Google Scholar]
  72. Bunke, H.; Allermann, G. Inexact graph matching for structural pattern recognition. Pattern Recognit. Lett. 1983, 1, 245–253. [Google Scholar] [CrossRef]
  73. Bunke, H.; Shearer, K. A graph distance metric based on the maximal common subgraph. Pattern Recognit. Lett. 1998, 19, 255–259. [Google Scholar] [CrossRef]
  74. Liang, Y.; Zhao, P. Similarity Search in Graph Databases: A Multi-Layered Indexing Approach. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017; pp. 783–794. [Google Scholar]
  75. Daller, É.; Bougleux, S.; Gaüzère, B.; Brun, L. Approximate Graph Edit Distance by Several Local Searches in Parallel. In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods, Funchal, Portugal, 16–18 January 2018; pp. 149–158. [Google Scholar]
  76. Bai, Y.; Ding, H.; Sun, Y.; Wang, W. Convolutional Set Matching for Graph Similarity. arXiv 2018, arXiv:1810.10866. [Google Scholar]
  77. Ktena, S.I.; Parisot, S.; Ferrante, E.; Rajchl, M.; Lee, M.C.H.; Glocker, B.; Rueckert, D. Metric learning with spectral graph convolutions on brain connectivity networks. NeuroImage 2018, 169, 431–442. [Google Scholar] [CrossRef]
  78. Ma, G.; Ahmed, N.K.; Willke, T.L.; Sengupta, D.; Cole, M.W.; Turk-Browne, N.B.; Yu, P.S. Deep Graph Similarity Learning for Brain Data Analysis. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2743–2751. [Google Scholar]
  79. Wang, S.; Chen, Z.; Yu, X.; Li, D.; Ni, J.; Tang, L.; Gui, J.; Li, Z.; Chen, H.; Yu, P.S. Heterogeneous Graph Matching Networks for Unknown Malware Detection. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 3762–3770. [Google Scholar]
  80. Bai, Y.; Ding, H.; Qiao, Y.; Marinovic, A.; Gu, K.; Chen, T.; Sun, Y.; Wang, W. Unsupervised inductive graph-level representation learning via graph-graph proximity. arXiv 2019, arXiv:1904.01098. [Google Scholar]
  81. Borgwardt, K.M.; Kriegel, H. Shortest-Path Kernels on Graphs. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, TX, USA, 27–30 November 2005; pp. 74–81. [Google Scholar]
  82. Shervashidze, N.; Vishwanathan, S.V.N.; Petri, T.; Mehlhorn, K.; Borgwardt, K.M. Efficient graphlet kernels for large graph comparison. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, FL, USA, 16–18 April 2009; Volume 5, pp. 488–495. [Google Scholar]
  83. Shervashidze, N.; Schweitzer, P.; van Leeuwen, E.J.; Mehlhorn, K.; Borgwardt, K.M. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 2011, 12, 2539–2561. [Google Scholar]
  84. Kondor, R.; Pan, H. The Multiscale Laplacian Graph Kernel. In Proceedings of the 2016 Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; pp. 2982–2990. [Google Scholar]
  85. Yanardag, P.; Vishwanathan, S.V.N. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australi, 10–13 August 2015; pp. 1365–1374. [Google Scholar]
  86. Ding, X.; Zhang, Y.; Liu, T.; Duan, J. Deep Learning for Event-Driven Stock Prediction. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Palo Alto, CA, USA, 2015; pp. 2327–2333. [Google Scholar]
  87. Ding, X.; Zhang, Y.; Liu, T.; Duan, J. Knowledge-Driven Event Embedding for Stock Prediction. In Proceedings of the coling 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2133–2142. [Google Scholar]
  88. Weber, N.; Balasubramanian, N.; Chambers, N. Event Representations with Tensor-Based Compositions. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018; pp. 4946–4953. [Google Scholar]
  89. Ding, X.; Liao, K.; Liu, T.; Li, Z.; Duan, J. Event Representation Learning Enhanced with External Commonsense Knowledge. In Proceedings of the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 4893–4902. [Google Scholar]
  90. Kavumba, P.; Inoue, N.; Inui, K. Exploring Supervised Learning of Hierarchical Event Embedding with Poincaré Embeddings. In Proceedings of the 25th Annual Meeting of the Association for Natural Language Processing (ANLP), Kyoto, Japan, 12–25 March 2019; pp. 217–220. [Google Scholar]
  91. Trieu, H.; Tran, T.T.; Nguyen, A.D.; Nguyen, A.; Miwa, M.; Ananiadou, S. DeepEventMine: End-to-end neural nested event extraction from biomedical texts. Bioinformatics 2020, 36, 4910–4917. [Google Scholar] [CrossRef] [PubMed]
  92. Gui, H.; Liu, J.; Tao, F.; Jiang, M.; Norick, B.; Kaplan, L.M.; Han, J. Embedding Learning with Events in Heterogeneous Information Networks. IEEE Trans. Knowl. Data Eng. 2017, 29, 2428–2441. [Google Scholar] [CrossRef]
  93. Kriege, N.M.; Johansson, F.D.; Morris, C. A survey on graph kernels. Appl. Netw. Sci. 2020, 5, 6. [Google Scholar] [CrossRef] [Green Version]
  94. Kim, J.; Nguyen, N.L.T.; Wang, Y.; Tsujii, J.; Takagi, T.; Yonezawa, A. The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011. BMC Bioinf. 2012, 13, S1. [Google Scholar] [CrossRef] [Green Version]
  95. Ohta, T.; Pyysalo, S.; Tsujii, J. Overview of the Epigenetics and Post-translational Modifications (EPI) task of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, Portland, OR, USA, 24 June 2011; Tsujii, J., Kim, J., Pyysalo, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 16–25. [Google Scholar]
  96. Pyysalo, S.; Ohta, T.; Rak, R.; Sullivan, D.E.; Mao, C.; Wang, C.; Sobral, B.W.S.; Tsujii, J.; Ananiadou, S. Overview of the Infectious Diseases (ID) task of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, Portland, OR, USA, 24 June 2011; Tsujii, J., Kim, J., Pyysalo, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 26–35. [Google Scholar]
  97. Kim, J.; Wang, Y.; Yamamoto, Y. The Genia Event Extraction Shared Task, 2013 Edition—Overview. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; Nédellec, C., Bossy, R., Kim, J., Kim, J., Ohta, T., Pyysalo, S., Zweigenbaum, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 8–15. [Google Scholar]
  98. Pyysalo, S.; Ohta, T.; Ananiadou, S. Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; Nédellec, C., Bossy, R., Kim, J., Kim, J., Ohta, T., Pyysalo, S., Zweigenbaum, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 58–66. [Google Scholar]
  99. Ohta, T.; Pyysalo, S.; Rak, R.; Rowley, A.; Chun, H.; Jung, S.; Choi, S.; Ananiadou, S.; Tsujii, J. Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; Nédellec, C., Bossy, R., Kim, J., Kim, J., Ohta, T., Pyysalo, S., Zweigenbaum, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 67–75. [Google Scholar]
  100. Kim, J.; Han, X.; Lee, V.; Rebholz-Schuhmann, D. GRO Task: Populating the Gene Regulation Ontology with events and relations. In Proceedings of the BioNLP Shared Task 2013 Workshop, Sofia, Bulgaria, 9 August 2013; Nédellec, C., Bossy, R., Kim, J., Kim, J., Ohta, T., Pyysalo, S., Zweigenbaum, P., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 50–57. [Google Scholar]
  101. Dobson, P.D.; Doig, A.J. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 2003, 330, 771–783. [Google Scholar] [CrossRef] [Green Version]
  102. Toivonen, H.; Srinivasan, A.; King, R.D.; Kramer, S.; Helma, C. Statistical Evaluation of the Predictive Toxicology Challenge 2000–2001. Bioinformatics 2003, 19, 1183–1193. [Google Scholar] [CrossRef] [Green Version]
  103. Borgwardt, K.M.; Ong, C.S.; Schönauer, S.; Vishwanathan, S.V.N.; Smola, A.J.; Kriegel, H. Protein function prediction via graph kernels. Bioinformatics 2005, 21, 47–56. [Google Scholar] [CrossRef] [Green Version]
  104. Debnath, A.K.; Lopez de Compadre, R.L.; Debnath, G.; Shusterman, A.J.; Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 1991, 34, 786–797. [Google Scholar] [CrossRef]
  105. Freitas, S.; Dong, Y.; Neil, J.; Chau, D.H. A Large-Scale Database for Graph Representation Learning. arXiv 2020, arXiv:2011.07682. [Google Scholar]
  106. Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3613–3618. [Google Scholar]
  107. Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), Queensland, Australia, 13–16 August 1999; pp. 518–529. [Google Scholar]
  108. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
  109. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 3–8 December 2001; MIT Press: Cambridge, MA, USA, 2001; pp. 601–608. [Google Scholar]
  110. Miwa, M.; Pyysalo, S.; Ohta, T.; Ananiadou, S. Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinform. 2013, 14, 175. [Google Scholar] [CrossRef] [Green Version]
  111. Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
  112. Sgall, P.; Hajicová, E.; Hajicová, E.; Panevová, J.; Panevova, J. The Meaning of the Sentence in Its Semantic and Pragmatic Aspects; Springer Science & Business Media: New York, NY, USA, 1986. [Google Scholar]
  113. Sogancioglu, G.; Öztürk, H.; Özgür, A. BIOSSES: A semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 2017, 33, i49–i58. [Google Scholar] [CrossRef] [Green Version]
  114. Wang, Y.; Afzal, N.; Fu, S.; Wang, L.; Shen, F.; Rastegar-Mojarad, M.; Liu, H. MedSTS: A resource for clinical semantic textual similarity. Lang. Resour. Eval. 2020, 54, 57–72. [Google Scholar] [CrossRef] [Green Version]
  115. Lithgow-Serrano, O.; Gama-Castro, S.; Ishida-Gutiérrez, C.; Mejía-Almonte, C.; Tierrafría, V.H.; Martínez-Luna, S.; Santos-Zavaleta, A.; Velázquez-Ramírez, D.A.; Collado-Vides, J. Similarity corpus on microbial transcriptional regulation. J. Biomed. Semant. 2019, 10, 8. [Google Scholar] [CrossRef] [Green Version]
  116. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3980–3990. [Google Scholar]
  117. Domeniconi, G.; Moro, G.; Pagliarani, A.; Pasolini, R. On Deep Learning in Cross-Domain Sentiment Classification. In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management IC3K, Funchal, Portugal, 1–3 November 2017; pp. 50–60. [Google Scholar] [CrossRef]
  118. Moro, G.; Pagliarani, A.; Pasolini, R.; Sartori, C. Cross-domain & In-domain Sentiment Analysis with Memory-based Deep Neural Networks. In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Seville, Spain, 18–20 September 2018; pp. 127–138. [Google Scholar] [CrossRef]
  119. Lewis, P.S.H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005, arXiv:2005.11401. [Google Scholar]
  120. Moro, G.; Valgimigli, L. Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature. Sensors 2021, 21, 6430. [Google Scholar] [CrossRef]
  121. Li, S.; Jia, K.; Wen, Y.; Liu, T.; Tao, D. Orthogonal Deep Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1352–1368. [Google Scholar] [CrossRef] [Green Version]
  122. Frisoni, G.; Moro, G.; Carbonaro, A. Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining. In Proceedings of the 9th International Conference on Data Science, Technologies and Applications (DATA), Online, 7–9 July 2020; pp. 121–132. Available online: https://www.scitepress.org/Papers/2020/98920/98920.pdf (accessed on 11 December 2021).
  123. Frisoni, G.; Moro, G. Phenomena Explanation from Text: Unsupervised Learning of Interpretable and Statistically Significant Knowledge. In Proceedings of the International Conference on Data Management Technologies and Applications, Prague, Czech Republic, 26–28 July 2020; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1446, pp. 293–318. [Google Scholar] [CrossRef]
  124. Frisoni, G.; Moro, G.; Carbonaro, A. Unsupervised Descriptive Text Mining for Knowledge Graph Learning. In Proceedings of the 12th International Conference on Knowledge Discovery and Information Retrieval KDIR, Budapest, Hungary, 2–4 November 2020; Volume 1, pp. 316–324. [Google Scholar]
  125. Frisoni, G.; Moro, G.; Carbonaro, A. Towards Rare Disease Knowledge Graph Learning from Social Posts of Patients. In Proceedings of the International Research & Innovation Forum, Athens, Greece, 15–17 April 2020; Springer: Berlin/Heidelberg, Germany; pp. 577–589. [Google Scholar] [CrossRef]
  126. Domeniconi, G.; Semertzidis, K.; López, V.; Daly, E.M.; Kotoulas, S.; Moro, G. A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection. In Proceedings of the 5th International Conference on Data Management Technologies and Applications, Lisbon, Portugal, 24–26 July 2016; pp. 43–54. [Google Scholar] [CrossRef]
  127. Domeniconi, G.; Moro, G.; Pasolini, R.; Sartori, C. Iterative Refining of Category Profiles for Nearest Centroid Cross-Domain Text Classification. In Proceedings of the 6th International Joint Conference, Rome, Italy, 21–24 October 2014; pp. 50–67. [Google Scholar] [CrossRef]
  128. Moro, G.; Ragazzi, L. Semantic Self-segmentation for Abstractive Summarization of Long Legal Documents in Low-resource Regimes. In Proceedings of the Thirty-Six AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 1–9. [Google Scholar]
  129. Riccucci, S.; Carbonaro, A.; Casadei, G. Knowledge Acquisition in Intelligent Tutoring System: A Data Mining Approach. In Proceedings of the 6th Mexican International Conference on Artificial Intelligence, Aguascalientes, Mexico, 4–10 November 2007; pp. 1195–1205. [Google Scholar]
  130. Riccucci, S.; Carbonaro, A.; Casadei, G. An Architecture for Knowledge Management in Intelligent Tutoring System. In Proceedings of the Cognition and Exploratory Learning in Digital Age, CELDA 2005, Porto, Portugal, 14–16 December 2005; pp. 473–476. [Google Scholar]
  131. Andronico, A.; Carbonaro, A.; Colazzo, L.; Molinari, A. Personalisation services for learning management systems in mobile settings. Int. J. Contin. Eng. Educ. Life-Long Learn. 2004, 14, 353–369. [Google Scholar] [CrossRef]
  132. Domeniconi, G.; Moro, G.; Pagliarani, A.; Pasini, K.; Pasolini, R. Job Recommendation from Semantic Similarity of LinkedIn Users’ Skills. In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods, Rome, Italy, 24–26 February 2016; pp. 270–277. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.