Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora

Sboev, Alexander; Rybka, Roman; Selivanov, Anton; Moloshnikov, Ivan; Gryaznov, Artem; Naumov, Alexander; Sboeva, Sanna; Rylkov, Gleb; Zakirova, Soyora

doi:10.3390/math11020354

Open AccessArticle

Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora

by

Alexander Sboev

^1,2,3,*

,

Roman Rybka

^1,3

,

Anton Selivanov

¹

,

Ivan Moloshnikov

¹

,

Artem Gryaznov

¹

,

Alexander Naumov

¹

,

Sanna Sboeva

¹,

Gleb Rylkov

¹ and

Soyora Zakirova

¹

Complex of NBICS Technology, National Research Center “Kurchatov Institute”, Academic Kurchatov sq., 123182 Moscow, Russia

²

Department of Computer and Engineering Modeling, National Research Nuclear University “MEPhI”, Kashirsk. hw., 115409 Moscow, Russia

³

Department of Automated Systems of Organizational Management, Russian Technological University “MIREA”, Vernadsky av., 119296 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(2), 354; https://doi.org/10.3390/math11020354

Submission received: 20 November 2022 / Revised: 30 December 2022 / Accepted: 3 January 2023 / Published: 9 January 2023

(This article belongs to the Special Issue New Machine Learning and Deep Learning Techniques in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

An extraction of significant information from Internet sources is an important task of pharmacovigilance due to the need for post-clinical drugs monitoring. This research considers the task of end-to-end recognition of pharmaceutically significant named entities and their relations in texts in natural language. The meaning of “end-to-end” is that both of the tasks are performed within a single process on the “raw” text without annotation. The study is based on the current version of the Russian Drug Review Corpus—a dataset of 3800 review texts from the Russian segment of the Internet. Currently, this is the only corpus in the Russian language appropriate for research of the mentioned type. We estimated the accuracy of the recognition of the pharmaceutically significant entities and their relations in two approaches based on neural-network language models. The first core approach is to sequentially solve tasks of named-entities recognition and relation extraction (the sequential approach). The second one solves both tasks simultaneously with a single neural network (the joint approach). The study includes a comparison of both approaches, along with the hyperparameters selection to maximize resulting accuracy. It is shown that both approaches solve the target task at the same level of accuracy: 52–53% macro-averaged

F 1

-

s c o r e

, which is the current level of accuracy for “end-to-end” tasks on the Russian language. Additionally, the paper presents the results for English open datasets ADE and DDI based on the joint approach, and hyperparameter selection for the modern domain-specific language models. The result is that the achieved accuracies of 84.2% (ADE) and 73.3% (DDI) are comparable or better than other published results for the datasets.

Keywords:

Russian Drug Review Corpus; deep learning; language models; named-entity recognition; relation extraction; joint model; natural language processing; pharmacovigilance; DDI; ADE

MSC:

68T50; 68T07

1. Introduction

The exchange of information among users of online platforms makes it possible to conduct research on user experience, incl. in relation to various drugs. This is important for both pharmaceutical companies and pharmacovigilance services in the post-clinical monitoring of drug use.

Conducting research on Internet texts is complicated by the presence of specific vocabulary in the texts of Internet users, various errors and inconsistent narrations. Often, when describing their experience of medicines use, users compare the effects of different drugs under various circumstances (symptoms, diseases) and the experience of other people. Thus, the task of identifying named entities used in the same user experience is important and quite difficult, especially for languages with free word order, such as Russian. One of the possible formalizations of the solution of this problem is its decomposition into: (1) recognition of named entities (NER) important from the point of view of pharmacovigilance; and (2) extraction of the relationship between them based on their usage in one user experience (RE).

Such a formalization makes it possible to build solutions based on the approaches known in the literature: (A) sequential solution of NER and RE subtasks (see Section 2.1 and Section 2.2), and (B) joint solution (see Section 2.3). Both approaches have been extensively studied in the literature, but have not been directly compared. The most efficient implementations are based on transformer architecture neural networks trained on large collections of raw text data (language models).

The accuracy of solutions depends on a number of factors, the key to which is the available volume of labeled examples for setting up neural network models as part of the overall solution. For a number of languages, in particular English, several corpora of tagged biomedical texts, as well as corpora of reviews, have been developed (see Section 2). However, for the Russian language, studies of similar solutions for Internet texts of drug reviews have not been previously carried out due to the lack of a marked-up data corpus of sufficient volume. At present, we have prepared a dataset of texts of Russian-language reviews about medicines (RDRS, Russian Drug Review Corpora) with markups by named entities and their correspondence to user experience. This material allows us to assess the level of accuracy in the recognition of named entities and the extraction of relationships between them for Russian-language reviews, which is the purpose of this work.

At the stage of forming the RDRS corpus, we previously tested individual elements of the sequential approach: neural network models for solving NER and RE problems, separately. In this work, these elements are combined into a single analysis process with an assessment of the accuracy of the overall solution. On the other hand, recent studies of other languages show the promise of using a single neural network model for both NER and RE tasks, taking into account joint decision errors in the learning process.

To evaluate the effectiveness of a single neural-network model on the RDRS corpus (see Section 3.2), we tuned it, which involved choosing the most efficient language model and a combination of hyperparameters for training it based on grid searches (see Section 4.1). To justify the procedure used, it was applied to create a single solution on the basis of the latest language models, incl. multilingual (see Section 3.5), with an assessment on English-language corpora. This made it possible to improve the accuracy of this approach to the state-of-the-art level achieved on this dataset and compare it with previously obtained results.

Thus, the main contribution of this work is as follows:

An assessment of the end-to-end solution of NER and RE problems was carried out on the corpus of annotated Russian-language reviews of Internet users about medicines;
The results of a comparison of sequential and joint approaches to the end-to-end problem were presented on different versions of the used corpus of reviews;
On the material of English-language corpora, it was demonstrated that the use of modern-language models in combination with the hyperparameter selection procedure makes it possible to increase the accuracy of the joint solution to the state of the art.

The article outline is presented below. Section 2 describes tasks of named entity recognition (NER) and relation extraction (RE), and outlines the current state of machine-learning methods for the considered tasks in biomedical topic. Section 3.1 describes the used Russian and English datasets, and Section 3.2, Section 3.3, Section 3.4 and Section 3.5 characterize the approaches and language models used in the conducted research. The experimental setup, procedure of hyperparameter selection, evaluation methodology and achieved results are described in Section 4. Comparative analsysis of the different approaches and examples of their work are presented in Section 5.

2. Related Works

2.1. Named Entity Recognition

The task of identifying significant entities is key for many practical areas: topic evolution [1], aspect-sentiment analysis [2], and identification of personal communications [3], etc. For this, approaches based on information retrieval methods [1,4,5] are used as well as machine-learning methods, incl. neural networks. The latest approaches are the subject of study im this work and are used in the presence of a marked set of texts containing the markup necessary for the recognition of entities (a further NER task).

Traditionally, the NER task is considered as a sequence labeling task. For each token (punctuation mark, word or its part) j of text i, the model must give a

t a g

of the corresponding entity type if it refers to any of them. The most popular [6] option is the use of three classes of tokens for each entity type: Begin-, Input-, Output (BIO). The first two classes correspond to tokens in the entity mention, which determine the beginning and continuation of a named entity of a given type, respectively. The third class corresponds to tokens not included into the mentioned entity. These classes can be extended to BIOES [7] by adding class E for the end tokens of a named entity phrase, and class S for the entity phrases with one token.

More formally, the statement of the named entity recognition task is written as follows:

\forall x_{i} \in X^{l} : x_{i} = {(t o k e n)}_{j = 1}^{J_{i}},

(1)

where

X^{l}

denotes the original text corpus of cardinality l;

x_{i}

—text number i from the corpus;

t o k e n

—token under the number j from the text i; and

J_{i}

—amount of tokens in the text i. If

E T

—discrete space of entity types (

E T \subset Z

), then:

\begin{matrix} E^{i} = {({(s_{s t a r t}, s_{e n d})}^{P_{i k}}, {(t a g)}^{Q_{i k}})}^{K_{i}}, \\ {s_{s t a r t}}^{i k p}, {s_{e n d}}^{i k p} \in x_{i}, \\ t a g \in E T \end{matrix}

(2)

E^{i}

—a set of entities of text i;

K_{i}

—a number of entities in text i;

p = 1 \dots P_{i k}

—a number of spans of entity k of text i. When

P_{i k} = 1 \forall i, k

, the task does not include discontinuous entities;

q = 1 \dots Q_{i k}

—a number of tags of entity k of text i. When

Q_{i k} = 1 \forall i, k

, the task assumes the presence of only one tag for each entity; the task is simple, not multilabel classification;

s_{s t a r t}, s_{e n d}

—a number of start and end tokens of span p of entity k of text i;

t a g

—the one q tag from the discrete space of entity types

E T

, which was assigned to entity k of text i.

The above mathematical definitions (see Equations (1) and (2)) lead to the state where the original task requires a complete enumeration of combinations of tokens from

x_{i}

with a mapping into a discrete space

E T

of arbitrary dimension d. This task has an exponential dependency on the number of input tokens; therefore, in practice, one of the following options is used: a labeling scheme for named entities and reducing the problem to token classification, or limitation of the number of the token combinations.

The first option for the named entity annotation and reducing the task to the token classification has the following formulation:

\begin{matrix} s c h e m e = {B, I, O}, \\ \forall t a g \in E T : t o k e n_{i j} \to s c h e m e \end{matrix}

(3)

This means that for each entity

t a g

from the set of tags

E T

, each token j of text i is mapped to a one-dimensional discrete tag space of the NER annotation

s c h e m e

.

For the second option, reducing the space of possible token combinations, a simple and efficient method is to exclude discontinuous entities, considering only consecutive tokens [8], which drastically reduces the number of possible combinations.

Methods for the automatic extraction of significant pharmaceutical entities have developed from the use of classical machine-learning methods (for example, SVM [9,10,11], Random Forest [12], Naive Bayes Classifier [13]) to increasingly complex neural network models: convolutional [14], recurrent [15,16], graph [17], and various combinations [18,19]. Currently, the most efficient methods for pharmaceutically significant named entity extraction are neural networks based on pre-trained deep language models, which represent words as a vector with the use of context.

For example, on the challenge of the Social Media Mining For Health 2019 (SMM4H 2019) [20], the top-three results used language models in the task of adverse drug reactions (ADR) extraction from the messages of the social network Twitter. The paper [21] uses the BERT [22] model, additionally trained on a large unlabeled corpus of texts from the Twitter (∼1.5 M texts), which contained 150 drug names mentioned in the training set. BERT-encoded feature vectors, and Conditional Random Field (CRF) classified entity types.

The paper [23] uses the domain-specific language model BioBERT [24]: the general BERT model additionally trained on thematic abstracts of PubMed scientific articles (∼4.5B words) and texts from PubMed Central (∼13.5B words). The feature space also included frequency and dictionary features. Conditional Random Field (CRF) was used to determine the entity type. The domain-specific language model made it possible to achieve an F1-strict metric of 46.4% in the extraction of adverse effects from the “tweets”.

The paper [25] used a combination of features from BERT, the GLoVe distributive model [26], and a char representation vector formed with biLSTM. The combined feature vector was used as an input of LSTM with size of 512. This approach improved the previous accuracy to 52.1%. It was also shown that the elimination of GLoVe vectors has almost no effect on the accuracy of the model, and character embedding serves to efficiently encode the specific vocabulary that is not included in the basic BERT vocabulary.

The paper [27] uses the “biLSTM-CNN-Char” approach, similar to the previous one. The main differences are the use of a convolutional network to obtain character-by-character representation vectors, BioBERT as a language model, and the use of biLSTM to process the combined feature vector. These differences led to a macro-averaged

F 1

-

s t r i c t

metric equal to 76.73% on the SMM4H 2019 corpus. The paper also considers the application of this model to other English-language corpora: the ADE [28] corpus, which contains sentences from PubMed database articles and reflects relations between drugs and adverse effects; and the CADEC [29] dataset, which contains drug reviews from the “askapatient” forum. The biLSTM-CNN-Char model achieved a macro-averaged

F 1

-

s t r i c t

accuracy of 91.75% for the ADE corpus and 78.76% for the CADEC corpus.

Mentioned approaches achieve significantly higher accuracy on the basis of language models. For example, [30] also uses a similar neural network based on biLSTM-CNN-Char, but without BioBERT or another Transformer-based pre-trained language model as a feature encoder, resulting in significantly lower accuracy of named entities recognition (macro-averaged

F 1

-

s t r i c t

): 32.69%, 82.57% and 65.16% for SMM4H 2019, ADE and CADEC corpus, respectively.

The paper [31] considers the learning procedure for the biLSTM model with word vectorization via word2vec. The proposed procedure includes pre-training on a large corpus of medical texts (a 2GB corpus of web pages categorized as “medical domain” by the Blekko search engine (https://en.wikipedia.org/wiki/Blekko, accessed on 6 January 2022)), DBPedia [32] knowledge graph vectorization [33], and gives state-of-the-art accuracy on the CADEC corpus (93.4% F1-strict for the ADE class), while, without the use of large corpora of thematic texts for model pre-training, the accuracy was 71.9%.

From the result, it can be concluded that the achievement of high accuracy in the task of extracting pharmaceutically significant named entities is based on the use of language models pre-trained on large corpora of thematic texts. This is shown for three data sets with different types of texts, including drug reviews (CADEC).

2.2. Relation Extraction

The task of extracting relations between named entities is usually considered as a classification problem: for each pair of entities,

< e_{h e a d}; e_{t a i l} >

, for each type of relationship between entities r determines whether there is a relationship

r t

between entities

e_{h e a d}

and

e_{t a i l}

. The types of relationships are usually mutually exclusive; therefore, it is possible to not solve the multilabel classification task.

The more formal way to formulate the relation extraction task on the basis of named entities recognition formulation is the following. The

R T

—discrete space of the relation types (

R T \subset Z

), then:

\begin{matrix} R^{i} = {(e_{h e a d}, e_{t a i l}, t y p e)}_{m = 1}^{K_{i} * K_{i}}, \\ e_{h e a d}, e_{t a i l} \in E^{i} \\ t y p e \in R T; \end{matrix}

(4)

where

R^{i}

denotes a set of relations of the text i,

e_{h e a d}

and

e_{t a i l}

are entities from the set E of the text i (determined in Section 2.1),

t y p e

is a type of relation from discrete space

R T

, and

K_{i}

is the power of the

E^{i}

set of the entities of the text i.

The task of relation extraction is to build a projection from the Cartesian product of the entities set

E^{i}

on self to discrete space

R T

:

E^{i} \times E^{i} \to R T .

(5)

The first historical simplest solution of the task in the biomedical topic is based on the co-occurrence frequency of entities [34,35], or rules [35,36,37]. A significant disadvantage of such methods is a low adaptation to a specific data type. In fact, these methods are not adjusted to the conditions of use. The next stage in the development of methods for relationship extraction is classical machine-learning methods: SVM and its modifications [28,38,39,40], Naive Bayes Classifier [41], hidden Markov model [42,43] with subsequent transition to neural networks: fully connected [44], recurrent [45,46], convolutional [47,48]. The paper [49] provides a comprehensive overview of methods applicable to the task of named entity and relationship extraction, and drug–drug interaction in particular. Many of them are based on using linguistic features, convolutional and recurrent networks [49]. With the progress of computational capabilities and methodological bases, language models with pre-training on large corpora of thematic texts [50,51,52,53] also began to be used in this task.

In the modern scientific literature, two corpora are mentioned that contain texts of social networks and have annotations for relationships between entities: TwiMed [54], BEAR [55]. The listed datasets contain texts of the social network Twitter.

The BEAR corpus consists of 2100 texts of the social network Twitter, contains ∼6000 entities and ∼3000 connections, and was announced in April 2022. At the time of writing, no research on this dataset was published.

The TwiMed corpus initially included 1000 texts from the social network Twitter and 1000 abstracts from the PubMed database. The TwiMed-Twitter part is annotated with the following types of relations: reason for acceptance (reason-to-use), positive outcome (outcome-positive), negative outcome (outcome-negative), a total of 1247 relations. The paper [56] explores various models for relation extraction in the TwiMed dataset. Most of them are based on neural networks and analyzed further with the citations for the original sources of the models. The paper [57] considers two LSTM-based models: biLSTM and attention-based biLSTM. Input features include vector from the Embedding layer without preliminary training, and the word’s relative position to the considered pair of entities. The ensemble of two models achieved the greatest accuracy in the original work: on the TwiMed-Twitter corpus the accuracy was 75%. The approach from [58] uses a four-layer convolutional network. The main difference from other works is Embedding pre-trained on a large corpus of thematic texts, and additional positional encoding of words in a sentence. The accuracy (micro-averaged

F 1

-

s c o r e

) on the TwiMed-Twitter corpus is 76%. The approach from [59] also uses convolutional networks, but expands the input space using five distributive vectorization models trained on different sets of texts, including thematic ones (PMC, PubMed, MedLine, Wikipedia). The extended feature space improves accuracy to 78%. The paper [56] uses biLSTM and two-layer self-attention with residual connections. Adding a self-attention mechanism increases the accuracy of relation extraction from “tweets” to 80%.

Thus, the modern scientific literature has not properly researched the use of language models for the analysis of biomedical and pharmaceutical texts from social networks.

Based on the study of recent articles, it can be concluded that it is effective (a) to pre-train the data-encoding models on a large corpora of thematic texts to expand the feature space, and (b) to use the attention and self-attention mechanisms that are the core of the language models built on the Transformer topology.

2.3. Joint Solution of the Tasks

In recent years, a joint approach to solving both tasks has been developed (joint NER and RE) [8,60,61,62,63,64,65,66].

For example, the solution of the named entities extraction and relation extraction tasks was considered at the National NLP Clinical Challenge 2018 (n2c2 2018) [61] organized by Harvard Medical School. The best end-to-end solution was the model from the paper [67] based on a joint approach. The model consists of two biLSTM-CRF blocks. The first block allocates “target” entities of the Drug class (included in all relations in the dataset), and the second block predicts another entity and types of relationships between the predicted and “target” entity. For each entity that is defined at the first stage, a copy of the text is created in which only related entities are marked. This approach has an accuracy of end-to-end relation extraction of the predicted entities equal to 89% (micro-averaged

F 1

-

s c o r e

).

The paper [68] uses a convolutional model to determine named entities, and a fully connected network to determine whether a pair of entities is related. Features for each word are rule-, dictionary- and frequency-based. As a result, the joint approach achieves an accuracy of 63.9% on the ADE corpus, 2.8% better than the approach with the separate solution of the tasks with a similar model topology.

The paper [60] represents words as a character-by-character vector based on a convolutional network, processed with the stacked biLSTM including two blocks: for NER and RE task, respectively. Blocks fitting is a single process. The input of the RE block is a complete enumeration of all possible combinations of predicted entities with the addition of syntactic features. This approach gives an accuracy of 71.4% (macro-averaged

F 1

-

s c o r e

) on the ADE corpus. The authors analyzed the sequential operation of models without a joint approach, and came to the conclusion that the accuracy experiences insignificant changes on the ADE corpus.

The approach in paper [8] is based on the features from the BERT language model as the input of two blocks of fully connected and pooling layers for NER and RE solutions, respectively. The NER block classifies every possible sequence of lengths from 1 to 10 sequential tokens. Predicted entities are filtered by the neural-network activities threshold. Resulting entities are paired, and for every possible pair RE block predicts relation. This approach achieves an accuracy of 78.84% on the ADE dataset.

The paper [63] uses a similar model with the addition of the part-of-speech features and the topic-oriented BioBERT language model. The use of a domain language model improves the accuracy of the ADE corpus by 3.19%, up to 82.03%. This accuracy is the current state of the art on the ADE dataset for the joint solution of NER and RE problems.

On the one hand, a joint approach provides a synergy between the solutions of the NER and RE tasks. On the other hand, it requires delicate tuning in order to solve both problems efficiently at the same time. Thus, a question of interest is a comparative analysis of the sequential and joint approaches to solving the NER and RE tasks.

The review of the related works shows that language models additionally trained on corpora of thematic texts are the most effective methods to determine relationships between pharmaceutically significant entities, both in professional texts and texts from social networks.

3. Materials and Methods

3.1. Data

3.1.1. Russian Drug Review Corpus (RDRS)

Current study is founded on the dataset Russian Drug Review Corpus (RDRS) [69]. The dataset contains user reviews on medications from the Otzovik.ru site. Each review was annotated by pharmaceutical specialists during the cross-validation process (see details in the original paper [69]). The corpus includes several annotation types: (1) named-entities annotation; and (2) separate cases of the medication use in each text (each case of use we, further, call “context”).

Named entities are categorized into more than 20 types, which are grouped into three sets:

“Medication” describes drugs with the attributes: drug class, administration specifics, release form, etc.;
“Disease” describes disease and related mentions: symptoms, course of the disease, etc.;
“Adverse Drug Reaction” defines mentions of the adverse effects, described by the Internet users.

The NER annotation in this corpus contains complex cases such as overlapping entities (see examples a,b,c,d in Figure 1), entities with same borders (see examples a,c in Figure 1) or discontinuous entities (see example c in Figure 1). This should be considered during process of entity extraction. However, the percentage of such cases is relatively small.

During the annotation process, each named entity was referred to one of the contexts, which distinguished the following cases in the same review: different cases of administration of the same medication, comparison of the different medications use, or different symptoms of the disease. Figure 2 shows an example of named entities and context annotation of the review.

The main RDRS corpus consists of 2800 texts of reviews and their annotations, which were divided into 5 subsets of train and test examples (folds). This dataset, further, was referred to as “RDRS-2800”. Additionally, an extended version of the RDRS corpus (referred to as “RDRS-3800”) was considered, which includes 1000 texts annotated according to the same principle, but the point of addition was to increase the number of unique medicines and diseases in the corpus. In order to evaluate the effect of expanding the corpus, these 1000 texts were added to the training part in each experiment during cross-validation; the test parts remain unchanged. In this paper, we use 4 types of relations between entities, which are interesting from the practical point of view:

ADR–Drugname—adverse effect of the particular medication;
Drugname–SourceInfodrug—source of the information about medication (e.g., “my brother gave me advice”, “apothecary mentioned”);
Drugname–Diseasename—a link between the disease and medication that user administrated against it;
Diseasename–Indication—symptoms of the particular disease (e.g., “red rash”, “high temperature”).

Pairs of entities are considered related if they are within the same context (relation class 1), and unrelated otherwise (relation class 0) (see examples in the Table 1). The numerical characteristics of both versions of the corpus are presented in Table 2.

The experiments on both the considered approaches includes 5-fold cross-validation, see detailed description in Section 4.1.

3.1.2. English-Language Corpora

According to the related works (see Section 2), the problem of named-entity recognition and relations extraction for the English language in the end-to-end formulation was solved for a number of corpora, incl. adverse drug effect corpus (ADE).

This corpus [28] consists of 2972 documents randomly selected from 30,000 PubMed article abstracts which were manually annotated by three annotators. The corpus contains three types of entities: Drug, Adverse Effect, Dosage. Annotators highlighted relations in sentences between drugs (Drug) and side effects (AdE); medicines (Drug) and dosages (Dosage). As a result, the corpus consists of 4272 sentences marked up for NER and RE under the Drug-AdE class, which is usually considered in articles [63].

Table 3 shows statistics on the number of texts, entities and relationships in the ADE corpus for each fold, sampled the same way as in [8].

However, ADE corpus includes entities similar in meaning to the RDRS corpus (“side effects” and “drugs” from ADE; “Adverse Drug Reaction” and “Drugname” from RDRS); a direct comparison of the accuracy of solving the problems of recognizing these entities and extracting relations between them for the ADE and RDRS corpus is difficult due to the different styles of texts used to create corpora: ADE is based on sentences from abstracts of PubMed articles; RDRS is based on texts of drug reviews from Internet users. The use of ADE corpus in experiments allows us to evaluate the possibility of using multilingual language models as part of end-to-end solutions for analyzing texts with pharmaceutically significant entities in different languages.

The DDI corpus is similar in text type to the ADE corpus, so it is also chosen for study (see Section 4) in this paper. The use of DDI extends the number of texts for providing analyses of multilingual language models.

DDI 2013 [70]—a corpus of excerpts from scientific articles that describe interactions between drugs. The data source is the DrugBank [71] database and medical notes from the MEDLINE (https://www.nlm.nih.gov/medline/medline_overview.html, accessed: 6 January 2022) database on the topic of drug interactions.

In this work, we used the DDI corpus version of the BLURB [50] benchmark, where a part of the original texts (90) was filtered out, and remaining texts were split into sentences, with each one considered as a separate text. Information about the composition of the corpus is presented in Table 4.

3.2. Sequential Approach

In a sequential approach, named-entities recognition is the first separate step of the end-to-end approach. In the preprocessing, a high-level tokenizer is used to split input text into separate words. A pre-trained language model represents words as real-valued vectors. Models of this type usually include a dictionary and a tokenizer which breaks words into frequent subwords (tokens) which could be whole words, word stems, suffixes or endings. In addition, the language-model tokenizer adds a special token “[CLS]” at the beginning of the text. After the text is split into N subtokens, they are represented as indices corresponding to the rows of the embedding matrix in the language model. Feeding a sequence of tokens into the language model, we obtain N real-valued vectors

T_{N, embSize 1}

, where embSize1—the dimension of the vector for token encoding. Further, these vectors are passed into the classification layers. The number of classification layers corresponds to the number of entity classes, and each layer has three outputs, corresponding to the labels “B”—the beginning of the entity, “I”—the continuation of the entity, or “O”—the token does not belong to the entity. The output for each token is a vector of class labels from the concatenation of classifier outputs:

C_{N, 3 * c} = Softmax ({FFNN}_{1} (T_{N, embSize 1})) ⨁ \dots ⨁ Softmax ({FFNN}_{c} (T_{N, embSize 1}))

(6)

In Equation (6), c is the number of entity classes;

F F N N

denotes fully connected layers (or feed forward neural network) for each entity class, each of these vectors refers to the separate class; and ⨁—concatenation operation. Predicted entities are defined with use of these vectors.

This approach was tested and proved to be successful for early version of RDRS corpora in our previous work [69]. The loss function for the neural network is the sum of the cross entropy losses for all the classification layers for each named entity type. Predicted entities form pairs according to the possible combinations of classes in the considered relation types. The task of relation extraction is a text-classification task, where the input text is formed from a pair of entities and additional features [72]. The input is formed as a concatenation of strings:

P = “ [CLS] ” ⨁ S_{1}, “ [ESEP] ” ⨁ S_{2} ⨁ “ [TXTSEP] ” ⨁ T_{N}

(7)

In Equation (7), “[CLS]”, S1, “[ESEP]” and “[TXTSEP]” are special tokens, S1 and S2—texts of extracted entities, and TN—review text. The resulting string concatenation is also processed by the tokenizer and the neural network language model, after which the first token vector “[CLS]” is passed into the classification layer with the Softmax activation function and the number of output neurons equal to the number of relationship classes. The training of the model on the second stage (classification of the entity pairs) is based on expert-annotated entities (gold named entities). The evaluation of the entire pipeline uses predicted entities in the relation-extraction task. A relation is considered correctly predicted if both the entities and the type of relation are predicted correctly (for more details, see Section 4.1). The loss function for the relation classification network is cross-entropy.

3.3. Joint Approach

As in the previous approach, the input text is first split into words and then into subwords according to the dictionary of the language model. The first token of the sequence is the special token “[CLS]”, its corresponding vector is further used as a representation of the entire text. The resulting sequence of tokens is transformed by the language model into a sequence of vectors

T_{N, embSize 1}

. Unlike the previous approach, it does not use classification of tokens. Instead, all possible

S_{M}

spans are formed from the tokens—sequences of consecutive tokens. The span length lays in the interval from 1 to the maximum length specified before the start of training. Vector representations for spans is a concatenation of two vectors:

s_{i j} = maxpool (t_{i}, t_{i + 1}, \dots t_{j}) ⨁ Embedding (j - i) : i \leq j; j - 1 \leq m a x_s p a n_l e n g t h

(8)

In Equation (8),

maxpool (t_{i}, t_{i + 1}, \dots t_{j})

is the result of a maxpooling operation applied to the vectors of the tokens in the span (for each component of the resulting vector, the maximum of the corresponding components of the token vectors is selected), and

Embedding (j - i)

is the embedding vector of the span length, i and j are the indexes of the first and last tokens of the span. The span vector and the vector of a special token added to the beginning of the text are concatenated and passed into a fully connected classification layer, which determines whether the span belongs to one of the entity classes, or is not an entity.

C_{M, c + 1} = Softmax (FFNN (s_{i j} ⨁ t_{0}^{[C L S]}))

(9)

In Equation (9),

c + 1

—the number of classes including the “not an entity” class, and FFNN—fully connected layer. For spans recognized as entities (assigned to any class), the pairing and classification procedure is performed. Unlike the previous approach, all possible pairs are considered, without taking into account span classes. Vector representations of pairs are formed as follows:

p_{i j i^{'} j^{'}} = s_{i j} ⨁ m a x p o o l (t_{j} + 1, \dots, t_{i}^{'} - 1) ⨁ s_{i^{'} j^{'}}

(10)

In Equation (10), i, j,

i^{'}

, and

j^{'}

are the indexes of the start and end tokens of the first and second span in the pair, and

m a x p o o l (t_{j} + 1, \dots, t_{i}^{'} - 1)

is the local context vector of the pair obtained using the maxpooling operation applied to vectors of the tokens between considered entities. The vector

p_{i j i^{'} j^{'}}

is fed into a fully connected layer, with output neurons with sigmoid activation. A pair of spans is considered related and has a relation type that corresponds to the maximum activation value if it is above the predetermined threshold. The loss functions for entity classification layer is categorical cross-entropy loss, and for relation classification layer is binary cross-entropy loss.

3.4. Comparative Analysis of Both Approaches

The most significant differences between the selected end-to-end approaches for named-entity recognition and relation-extraction tasks are presented in Table 5.

The sequential approach has more functionality due to the ability to work with complex phrase entities (overlapping and discontinuous). However, such cases are quite rare; for example, the ratio of ADR discontinious mentions of ADR tag is ∼2% for the RDRS corpora (see [69]). The joint approach is computationally simpler due to the need to train only one common model.

3.5. Language Models

Both described approaches use deep-language models of transformer topology. This paper considers several language models that have been shown to be effective for NLP tasks in Russian and English.

XLM-RoBERTa-large [73] (further XLMR) is a language model with a transformer architecture, reciprocally to BERT [22], trained on the unsupervised tasks of masked-token prediction and next-sentence prediction. RoBERTa had significantly bigger training dataset and different training parameters. The RoBERTa training set had 160 GB of texts in different languages, including: BookCorpus [74], English Wikipedia (https://en.wikipedia.org/wiki/, accessed on 6 January 2022), CC-News [75], and OpenWebText (http://Skylion007.github.io/OpenWebTextCorpus, accessed on 6 January 2022), Stories. Large version of XLM-RoBERTa has ∼550 M parameters.

XLM-RoBERTa-sag (further XLMR-sag)—version of the XLM-RoBERTa-large model adapted for pharmaceutical product reviews [69]. It was additionally trained on the thematic corpus (https://huggingface.co/sagteam/xlmroberta-large-sag, accessed on 6 January 2022) including 2 sets of texts: the first one contains 250,000 drug reviews and was collected from the site irecommend.ru, the second set was taken from the part of RuDReC [76] without annotation.

RuBERT [77]—the model is based on BERT-base, uses multilingual BERT weights, modified dictionary and embedding models, trained on Russian-language news and Wikipedia texts.

RuDr-BERT and EnRuDr-BERT [76]—two language models based on original multilingual BERT, additionally trained on the RuDReC corpus (1.4 M reviews of medications) [76]. EnRuDr BERT was additionally trained on english collection of consumer comments on drug administration [76].

RuBioRoBERTa [78]—language model based on RoBERTa [79], additionally trained on Russian-language Wikipedia; Taiga (https://tatianashavrina.github.io/taiga_site/, accessed on 6 January 2022); and Russian-language scientific articles from the CyberLeninka (https://cyberleninka.ru/, accessed on 6 January 2022) database by category: fundamental medicine, clinical medicine, health sciences, biotechnology, and medical technology. The training corpus included 338,000 texts (∼1.2 billion words).

BioLinkBERT [80]—a model based on BERT [22] which differs in the pre-training process: LinkBERT solves the the masked-language modeling (MLM) task, as in the original BERT, and, in addition, document-relation-prediction (DRP) task in the following statement: for segments of two texts, determine whether the source texts have a relationship: citation for scientific texts (PubMed, PMC), and hyperlink for the texts of the Wikipedia. The BioLinkBERT model was trained on abstracts of PubMed Central scientific articles (∼21 GB), similar to PubMedBERT [50]. In this work, the BioLinkBERT-large version of the model was used.

BioALBERT [51]—model based on ALBERT [81], similar to BERT. The main feature of this model is reducing the dimensionality of the input vector before encoding (factorized embedding representation) and the use of shared weights for all layers of the transformer (cross-layer parameter sharing). The ALBERT model was trained on Wikipedia and BookCorpus resource texts. BioALBERT was additionally trained on thematic texts: PubMed abstracts, PubMed Central articles, and medical records from the MIMIC-III database. In this work, we used a larger version of BioALBERT—BioM-ALBERT-xxlarge [82].

Table 6 shows the sizes of the layers and the total number of parameters of the used language models.

4. Experiments and Results

4.1. Description of the Experiments

Experiments for both the basic (RDRS-2800) and extended (RDRS-3800) RDRS versions included:

Estimation of joint-approach accuracy for known effective language models: RuBERT, RuDr-BERT, EnRuDr-BERT, RuBioRoBERTa, XLMR and XLMR_sag. The selection of hyperparameters was carried out using the GridSearch method. The criteria for the choice is a validation loss on each search, which is carried out on a separate dataset;
Every approach accuracy was estimated on the XLMR and XLMR_sag language models, which showed the best results in the separate solution of the NER and RE problems in our previous studies, (see [69,83]). In the sequential use of the models, the following hyperparameters were used for both tasks: batch size 8, learning rate $2.50 \times 10^{- 6}$ .

Experiments for the English-language DDI and ADE corpora are aimed at the estimation accuracy of the joint approach in the target-task solution. The most efficient combination of hyperparameters and language models was searched in the GridSearch procedure for BioLinkBERT, BioALBERT, and XLM-RoBERTa-sag with validation loss as the optimization criteria.

We used the following configuration of the computational resources: CPU Intel Xeon E5-2650v2 8 cores; GPU NVIDIA V100 16 GB, RAM 128 GB. The batch size of 2 was used in every experiment on the joint approach due to constraints on a GPU resources. We used the following values of hyperparameters for computations:

Initial learning rate: 1 $\times 10^{- 4}$ , 5 $\times 10^{- 5}$ , $2.5 \times 10^{- 5}$ , $1 \times 10^{- 5}$ , $7.5 \times 10^{- 6}$ , $5 \times 10^{- 6}$ , $2.5 \times 10^{- 6}$ , $1 \times 10^{- 6}$ , $7.5 \times 10^{- 7}$ , $5 \times 10^{- 7}$ ;
Epochs number: 10, 15, 20, 25, 30.

As a result, both named entities and relations between them are automatically extracted; therefore, the accuracy of both task solutions was estimated at the same time. A correctly predicted named entity is a phrase with an exact match of boundaries and entity classes to the reference annotation of the dataset. The correct prediction of a relation is a pair of entities with boundaries and classes matched with the reference annotation, and correctly determined relationship type.

To assess the approach of the joint solution of NER and RE problems, the

F 1

-

s c o r e

metric was used with the aggregation by relation types.

RDRS corpus uses the following discrete spaces of entity and relationship types:

E T = {A D R, D i s e a s e N a m e, D r u g N a m e, I n d i c a t i o n, S o u r c e I n f o D r u g}

(11)

\begin{matrix} R T = {A D R_D r u g N a m e, D r u g N a m e_D i s e a s e N a m e, \\ D r u g N a m e_S o u r c e I n f o D r u g, D i s e a s e N a m e_I n d i c a t i o n} \times {0, 1} \end{matrix}

(12)

c \in R T

—class number from a discrete set of relationship classes between

R T

entities.

\begin{matrix} T P^{i, c} = ∥ {R^{i, c}}_{g o l d} \cap {R^{i, c}}_{p r e d} ∥ = ∥ ({e_{h e a d}, e_{t a i l}}^{g o l d} = {e_{h e a d}, e_{t a i l}}^{p r e d}) \land \\ (t y p e_{p r e d} = t y p e_{g o l d} = c) ∥, \end{matrix}

(13)

\begin{matrix} e^{g o l d} = e^{p r e d} : ({s_{s t a r t}}^{g o l d} = {s_{s t a r t}}^{p r e d}) \land ({s_{e n d}}^{g o l d} = {s_{e n d}}^{p r e d}) \land \\ ({t a g}^{g o l d} = {t a g}^{p r e d}); \end{matrix}

(14)

T P^{c} = {\sum_{i = 1}}^{l} T P^{i, c}

(15)

F P^{i, c} = ∥ {R^{i, c}}_{p r e d} ∖ {R^{i, c}}_{g o l d} ∥ = ∥ (t y p e_{p r e d} = c) \land (t y p e_{g o l d} \neq c) ∥

(16)

F P^{c} = {\sum_{i = 1}}^{l} F P^{i, c}

(17)

F N^{i, c} = ∥ {R^{i, c}}_{g o l d} ∖ {R^{i, c}}_{p r e d} ∥ = ∥ (t y p e_{g o l d} = c) \land (t y p e_{p r e d} \neq c) ∥

(18)

F N^{c} = {\sum_{i = 1}}^{l} F N^{i, c}

(19)

{F 1 - s c o r e}^{c} = \frac{2 T P^{c}}{2 T P^{c} + F P^{c} + F N^{c}}

(20)

F 1 - m a c r o = \frac{1}{∥ R T ∥} {\sum_{c = 1}}^{∥ R T ∥} {F 1 - s c o r e}^{c}

(21)

Here, i is a number of a text from

X^{l}

, c—a number of the relation class from discrete space

R T

,

{R^{i, c}}_{g o l d}

—set of the ground-truth relations of the relation type c from the text i,

{R^{i, c}}_{p r e d}

—a set of the predicted relations of the relation type c from the text i,

T P^{c}

—number of the relations that were correctly predicted (true positive values),

F P^{c}

—number of the relations that were predicted as class c but have different class (false positive values),

F N

—number of the relations of class c that were predicted as the different relation class (false negative values),

F 1

-

s c o r e

—a metric for the evaluation of the model in prediction relation type c, and macro-averaged

F 1

-

s c o r e

(

F 1

-

m a c r o

)—an aggregated metric for all classes of the relations.

Most of studies published on the DDI corpus use micro-averaged

F 1

-

s c o r e

to aggregate the accuracy of separate relationship classes [84] without matching entity types involved in the relationship:

T P = {\sum_{c = 1}}^{∥ R T ∥} {\sum_{i = 1}}^{l} {T P}^{i, c}

(22)

F P = {\sum_{c = 1}}^{∥ R T ∥} {\sum_{i = 1}}^{l} {F P}^{i, c}

(23)

F N = {\sum_{c = 1}}^{∥ R T ∥} {\sum_{i = 1}}^{l} {F N}^{i, c}

(24)

F 1 - m i c r o = \frac{2 T P}{2 T P + F P + F N}

(25)

This aggregation focuses on the fraction of correctly predicted examples, rather than aggregation of class prediction accuracy. Due to the use of this metric in the literature, we also used micro-averaged

F 1

-

s c o r e

(

F 1

-

m i c r o

) to evaluate the results on the DDI corpus.

4.2. Results on the RDRS Dataset

The results of the accuracy estimation on different versions of the RDRS corpus for the considered language models as part of the joint approach are presented in Table 7.

The XLMR and XLMR-sag models provide the best accuracy.

Table 8 presents the best results for the named-entity relation extraction for the RDRS corpus on the basis of the joint and sequential approaches.

The NER model in the sequential approach has worse accuracy in comparison to the joint model (see Table 9). However, the final estimation of relation extraction for both approaches is comparable; the sequential approach has the better accuracy in relationship identification among the predicted entities.

Therefore, the best accuracy in the automatic detection of related named entities in Russian-language Internet texts on medications by the macro-averaged

F 1

-

s c o r e

metric is 52.1% for the RDRS-2800 base corpus and 53.5% for the RDRS-3800 extended training sample corpus.

4.3. Results on the English-Language Datasets

The joint approach was tested with state-of-the-art thematic English-language models on two English corpora: ADE and DDI.

Table 10 shows

F 1

-

m a c r o

scores for the joint solution of the NER and RE tasks on the ADE dataset (subpart with ADE-Drug relations only).

Table 11 shows

F 1

-

m a c r o

and

F 1

-

m i c r o

scores for the joint solution of the NER and RE tasks on the DDI dataset.

5. Discussion

The analysis of the results on the RDRS corpus shows that both approaches demonstrate similar accuracy in solving the problem of named-entities relation extraction: 53.2–53.5%. The results are compatible because the accuracy was calculated as an average score over five cross-validation folds. The best accuracy is based on the domain-specific version of the XLM-RoBERTa-large-sag language model as part of both approaches. The multilingual version of this language model without additional training on a large corpus of unlabeled examples shows 1% worse results. The extension of the training part of the main corpus of 2800 texts by 1000 reviews increased the accuracy of named-entities recognition of the ADR type from 54.8% to 60.2% in the NER part of the sequential approach and from 63.8% to 65.4% in the joint model. The corpus extension also increased the accuracy of end-to-end solution from 51.2% to 53.2% and from 51.8% to 53.5% for sequential and joint approaches, respectively.

These results establish the state-of-the-art accuracy for the NER and RE tasks for the Russian language on texts of drug reviews. The results also show that both approaches have very similar overall accuracy, but have different weaknesses. The NER part of the sequential approach should be improved. Improvement in the joint approach should be carried out to handle difficult cases when recognizing named entities.

Further error analysis for the considered approaches (sequential and joint) on the RDRS-3800 dataset with XLMR-sag model shows the following:

There is a common prediction error with wrong borders of the entity resulting in formally incorrect relation extraction (for example, annotated entity: “oxolin”, predicted entity: “oxolin ointment”, “ointment” was originally annotated as Drugform, not Drugname; annotated entity of ADR class: “vomit is a side effect”, predicted entity: “vomit”. It could be concluded that described cases are not meaningful errors leading to false information extraction, in a sense;
There are 240 unique texts for which the sequential approach predicted all relations correctly, and Joint did not; and 646 texts where situation is reverse (the joint approach did not misclassify relations). It shows that the approaches are comparable, and one is not directly better than the other.
The joint model generates more possible relations than the sequential (false-positive relations number is 51% of all relations in the test set; in comparison, the sequential approach had a false positive rate equal to 42%).

Figure 3 shows an example of translated text of review and comparison of the models prediction with specific errors.

An analysis of the results for the English-language corpora shows that the use of modern-language models of the ALBERT type as part of the joint approach increases the accuracy of both the separate stage of named-entity recognition and the overall solution. The obtained results of 84.2% for ADE and 79.9% for DDI are the best for solving the target task within a single neural network model and are currently comparable with the best accuracy of other solutions. The use of a multilingual-language model as part of this approach, additionally trained on the texts of reviews of Russian-speaking Internet users, also shows high accuracy. Therefore, it is possible in the future to build a single model for the analysis of multilingual texts.

6. Conclusions

This work demonstrates the level of accuracy of the joint solution of two subtasks: the recognition of named entities in texts and the extraction of relations between them, when analyzing Russian-language texts of drug reviews. The study shows that for a general end-to-end solution it is possible to use both an approach based on the sequential solving of subtasks by separate neural network models, and an approach based on a single neural network model. Both approaches demonstrate comparable accuracies for the overall solution (53.2% and 53.5%). These are the first published results of solving an end-to-end task for texts of this type in Russian. The scores obtained when defining named entities of several basic categories (77.3% on average) are also new results.

The hyperparameter-selection approach used and the study of the use of modern-language models made it possible to obtain the best results for solving the problem of extracting named entities on the English-language ADE and DDI corpora for the joint approach.

The main results of the work are:

We first obtained estimates of the accuracy of end-to-end information extraction solutions for drug reviews from Internet users in Russian, including the recognition of named entities and the extraction of relations.
We compared the sequential and joint approaches to the NER+RE task in the Russian Drug Reviews corpus and showed that although the sequential approach overcomes some of the shortcomings of the used implementation of the joint approach (inability to work with overlapping and discontinuous entities), both of them show very close accuracy rates to each other.
We improved the accuracy of solving the end-to-end problem (NER+RE) for the ADE corpus. For a DDI corpus, our accuracy estimates are comparable to SoTA solutions. Thus, this confirms the reliability of the methods used in the work.

Further work will be aimed at creating a unified model for the analysis of multilingual review texts as part of a system for monitoring user reactions on the Internet.

Author Contributions

Conceptualization, A.S. (Alexander Sboev) and R.R.; methodology, A.S. (Alexander Sboev) and I.M.; software, A.S. (Anton Selivanov), G.R., I.M. and A.G.; validation, A.N. and A.S. (Anton Selivanov); investigation, A.S. (Alexander Sboev), A.S. (Anton Selivanov), R.R., A.G. and I.M.; resources, A.S. (Alexander Sboev) and R.R.; data curation, A.S. (Alexander Sboev), S.S., A.G. and S.Z.; writing—original draft preparation, R.R., A.S. (Anton Selivanov), and A.S. (Alexander Sboev); writing—review and editing, A.S. (Alexander Sboev), R.R., A.G. and A.S. (Anton Selivanov); visualization, A.G., I.M. and G.R.; supervision, A.S. (Alexander Sboev); project administration, R.R.; funding acquisition, A.S. (Alexander Sboev). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Russian Science Foundation grant No. 20-11-20246.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Trained models are presented on the page of our team on the huggingface repository: https://huggingface.co/sagteam (accessed on 6 January 2022). The code is available at https://github.com/sag111/Relation_Extraction (accessed on 6 January 2022). RDRS corpora can be obtained through sending a request from the website of our project: https://sagteam.ru/en/med-corpus/ (accessed on 6 January 2022).

Acknowledgments

The study was supported by a grant from the Russian Science Foundation (project no. 20-11-20246). This work was carried out using computing resources of the federal collective usage center Complex for Simulation and Data Processing for Mega-science Facilities at NRC “Kurchatov Institute”, http://ckp.nrcki.ru/.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ADE	Adverse Events Dataset
AdE	Side-effects entity type in the ADE corpus
ADR	Adverse reaction
BERT	Bidirectional encoder representations from transformers
biLSTM	Bidirectional LSTM
CADEC	CSIRO Adverse Drug Event Corpus
CDR	The BioCreative V Chemical-Disease Relation Corpus
CNN	Convolutional neural network
CRF	Conditional random field
DDI	Drug–Drug Interaction 2013 Dataset
GLoVe	Global vector representation
KECI	Knowledge-enhanced collective inference model
LSTM	Long short-term memory recurrent neural network
MDPI	Multidisciplinary Digital Publishing Institute
MLP	Multilayer perceptron
n2c2	National NLP clinical challenge
NER	Named-entity recognition
RDRS	Russian Drug Review Corpora
RE	Relation extraction
RNN	Recurrent neural network
SMM4H	Social Media Mining for Health Challenge
w2v	word2vec vector representation
XLMR	XLM-RoBERTa-large
XLMR_sag	XLM-RoBERTa-sag-large

References

Gydovskikh, D.V.; Moloshnikov, I.A.; Naumov, A.V.; Rybka, R.B.; Sboev, A.G.; Selivanov, A.A. A probabilistically entropic mechanism of topical clusterisation along with thematic annotation for evolution analysis of meaningful social information of internet sources. Lobachevskii J. Math. 2017, 38, 910–913. [Google Scholar] [CrossRef]
Naumov, A.; Rybka, R.; Sboev, A.; Selivanov, A.; Gryaznov, A. Neural-network method for determining text author’s sentiment to an aspect specified by the named entity. In Proceedings of the Russian Advances in Artificial Intelligence, Moscow, Russia, 10–16 October 2020; Number 2648 in CEUR Workshop Proceedings. pp. 134–143. [Google Scholar]
Fields, S.; Cole, C.L.; Oei, C.; Chen, A.T. Using named entity recognition and network analysis to distinguish personal networks from the social milieu in nineteenth-century Ottoman–Iraqi personal diaries. Digit. Scholarsh. Humanit. 2022, fqac047. [Google Scholar] [CrossRef]
de Arruda, H.F.; Costa, L.d.F.; Amancio, D.R. Topic segmentation via community detection in complex networks. Chaos Interdiscip. J. Nonlinear Sci. 2016, 26, 063120. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Selivanov, A.A.; Moloshnikov, I.A.; Rybka, R.B.; Sboev, A.G. Keyword Extraction Approach Based on Probabilistic-Entropy, Graph, and Neural Network Methods. In Proceedings of the Russian Conference on Artificial Intelligence, Moscow, Russia, 10–16 October 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. Number 12412 in Lecture Notes in Computer Science. pp. 284–295. [Google Scholar] [CrossRef]
Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May–1 June 2003; Association for Computational Linguistics: Stroudsburg, PA, USA, 2003; Volume 4, pp. 142–147. [Google Scholar]
Liu, P.; Guo, Y.; Wang, F.; Li, G. Chinese named entity recognition: The state of the art. Neurocomputing 2022, 473, 37–53. [Google Scholar] [CrossRef]
Eberts, M.; Ulges, A. Span-Based Joint Entity and Relation Extraction with Transformer Pre-Training. In Proceedings of the European Conference on Artificial Intelligence, Digital, 29 August–8 September 2020; IOS Press: Amsterdam, The Netherlands, 2020; pp. 2006–2013. [Google Scholar]
Liu, X.; Chen, H. AZDrugMiner: An information extraction system for mining patient-reported adverse drug events in online patient forums. In Proceedings of the International Conference on Smart Health, Beijing, China, 3–4 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 134–150. [Google Scholar]
Sarker, A.; Gonzalez, G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Inform. 2015, 53, 196–207. [Google Scholar] [CrossRef] [Green Version]
Kiritchenko, S.; Mohammad, S.M.; Morin, J.; de Bruijn, B. NRC-Canada at SMM4H shared task: Classifying Tweets mentioning adverse drug reactions and medication intake. arXiv 2018, arXiv:1805.04558. [Google Scholar]
Rastegar-Mojarad, M.; Elayavilli, R.K.; Yu, Y.; Liu, H. Detecting signals in noisy data-can ensemble classifiers help identify adverse drug reaction in tweets. In Proceedings of the Social Media Mining & Shared Task Workshop at the Pacific Symposium on Biocomputing, Kohala, HI, USA, 4–8 January 2016. [Google Scholar]
Rajapaksha, P.; Weerasinghe, R. Identifying adverse drug reactions by analyzing Twitter messages. In Proceedings of the 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 24–26 August 2015; pp. 37–42. [Google Scholar]
Miranda, D.S. Automated detection of adverse drug reactions in the biomedical literature using convolutional neural networks and biomedical word embeddings. arXiv 2018, arXiv:1804.09148. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 260–270. [Google Scholar]
Cocos, A.; Fiks, A.G.; Masino, A.J. Deep learning for pharmacovigilance: Recurrent neural network architectures for labeling adverse drug reactions in Twitter posts. J. Am. Med. Inform. Assoc. 2017, 24, 813–821. [Google Scholar] [CrossRef]
Wen, X.; Zhou, C.; Tang, H.; Liang, L.; Jiang, Y.; Qi, H. Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition. arXiv 2022, arXiv:2210.10240. [Google Scholar]
Ma, X.; Hovy, E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1064–1074. [Google Scholar]
Chowdhury, S.; Zhang, C.; Yu, P.S. Multi-task pharmacovigilance mining from social media posts. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 117–126. [Google Scholar]
Weissenbacher, D.; Gonzalez, G. Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task. In Proceedings of the Fourth Workshop, Florence, Italy, 2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Chen, S.; Huang, Y.; Huang, X.; Qin, H.; Yan, J.; Tang, B. HITSZ-ICRC: A report for SMM4H shared task 2019-automatic classification and extraction of adverse effect mentions in tweets. In Proceedings of the Fourth Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task, Florence, Italy, 2 August 2019; pp. 47–51. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Miftahutdinov, Z.; Alimova, I.; Tutubalina, E. KFU NLP team at SMM4H 2019 tasks: Want to extract adverse drugs reactions from tweets? BERT to the rescue. In Proceedings of the Fourth Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task, Florence, Italy, 2 August 2019; pp. 52–57. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Aroyehun, S.T.; Gelbukh, A. Detection of adverse drug reaction in tweets using a combination of heterogeneous word embeddings. In Proceedings of the Fourth Social Media Mining for Health Applications (# SMM4H) Workshop & Shared Task, Florence, Italy, 2 August 2019; pp. 133–135. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Haq, H.U.; Kocaman, V.; Talby, D. Mining adverse drug reactions from unstructured mediums at scale. arXiv 2022, arXiv:2201.01405. [Google Scholar]
Gurulingappa, H.; Rajput, A.M.; Roberts, A.; Fluck, J.; Hofmann-Apitius, M.; Toldo, L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 2012, 45, 885–892, Text Mining and Natural Language Processing in Pharmacogenomics. [Google Scholar] [CrossRef]
Karimi, S.; Metke-Jimenez, A.; Kemp, M.; Wang, C. Cadec: A corpus of adverse drug event annotations. J. Biomed. Inform. 2015, 55, 73–81. [Google Scholar] [CrossRef] [PubMed]
Ge, S.; Wu, F.; Wu, C.; Qi, T.; Huang, Y.; Xie, X. Fedner: Privacy-preserving medical named entity recognition with federated learning. arXiv 2020, arXiv:2003.09288. [Google Scholar]
Stanovsky, G.; Gruhl, D.; Mendes, P. Recognizing mentions of adverse drug reaction in social media using knowledge-infused recurrent models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, 3–7 April 2017; pp. 142–151. [Google Scholar]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. Dbpedia—A large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef] [Green Version]
Bordes, A.; Weston, J.; Collobert, R.; Bengio, Y. Learning structured embeddings of knowledge bases. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 7–11 August 2011. [Google Scholar]
Ding, J.; Berleant, D.; Nettleton, D.; Wurtele, E. Mining MEDLINE: Abstracts, sentences, or phrases? In Biocomputing 2002; World Scientific: Singapore, 2001; pp. 326–337. [Google Scholar]
Jelier, R.; Jenster, G.; Dorssers, L.C.; van der Eijk, C.C.; van Mulligen, E.M.; Mons, B.; Kors, J.A. Co-occurrence based meta-analysis of scientific texts: Retrieving biological relationships between genes. Bioinformatics 2005, 21, 2049–2058. [Google Scholar] [CrossRef]
Ono, T.; Hishigaki, H.; Tanigami, A.; Takagi, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics 2001, 17, 155–161. [Google Scholar] [CrossRef] [Green Version]
Divoli, A.; Attwood, T.K. BioIE: Extracting informative sentences from the biomedical literature. Bioinformatics 2005, 21, 2138–2139. [Google Scholar] [CrossRef] [Green Version]
Zhou, G.; Su, J.; Zhang, J.; Zhang, M. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; pp. 427–434. [Google Scholar]
Airola, A.; Pyysalo, S.; Björne, J.; Pahikkala, T.; Ginter, F.; Salakoski, T. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinform. 2008, 9, S2. [Google Scholar] [CrossRef] [Green Version]
Xu, J.; Wu, Y.; Zhang, Y.; Wang, J.; Lee, H.J.; Xu, H. CD-REST: A system for extracting chemical-induced disease relation in literature. Database 2016, 2016. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Muzaffar, A.W.; Azam, F.; Qamar, U. A relation extraction framework for biomedical text using hybrid feature set. Comput. Math. Methods Med. 2015, 2015. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Feldman, R.; Regev, Y.; Finkelstein-Landau, M.; Hurvitz, E.; Kogan, B. Mining biomedical literature using information extraction. Curr. Drug Discov. 2002, 2, 19–23. [Google Scholar]
Skusa, A.; Rüegg, A.; Köhler, J. Extraction of biological interaction networks from scientific literature. Briefings Bioinform. 2005, 6, 263–276. [Google Scholar] [CrossRef] [Green Version]
Rosario, B.; Hearst, M.A. Classifying semantic relations in bioscience texts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, 21–26 July 2004; pp. 430–437. [Google Scholar]
Xu, Y.; Mou, L.; Li, G.; Chen, Y.; Peng, H.; Jin, Z. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1785–1794. [Google Scholar]
Mehryary, F.; Björne, J.; Pyysalo, S.; Salakoski, T.; Ginter, F. Deep learning with minimal training data: TurkuNLP entry in the BioNLP shared task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop, Berlin, Germany, 13 August 2016; pp. 73–81. [Google Scholar]
Wang, L.; Cao, Z.; De Melo, G.; Liu, Z. Relation classification via multi-level attention cnns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1298–1307. [Google Scholar]
Li, H.; Zhang, J.; Wang, J.; Lin, H.; Yang, Z. DUTIR in BioNLP-ST 2016: Utilizing convolutional network and distributed representation to extract complicate relations. In Proceedings of the 4th BioNLP Shared Task Workshop, Berlin, Germany, 13 August 2016; pp. 93–100. [Google Scholar]
Zhang, T.; Leng, J.; Liu, Y. Deep learning for drug–drug interaction extraction from the literature: A review. Briefings Bioinform. 2020, 21, 1609–1627. [Google Scholar] [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. HEALTH 2021, 3, 1–23. [Google Scholar] [CrossRef]
Naseem, U.; Dunn, A.G.; Khushi, M.; Kim, J. Benchmarking for biomedical natural language processing tasks with a domain specific albert. BMC Bioinform. 2022, 23, 144. [Google Scholar] [CrossRef] [PubMed]
Luo, L.; Lai, P.T.; Wei, C.H.; Arighi, C.N.; Lu, Z. BioRED: A rich biomedical relation extraction dataset. Briefings Bioinform. 2022, 23, bbac282. [Google Scholar] [CrossRef]
Milošević, N.; Thielemann, W. Comparison of biomedical relationship extraction methods and models for knowledge graph creation. J. Web Semant. 2023, 75, 100756. [Google Scholar] [CrossRef]
Alvaro, N.; Miyao, Y.; Collier, N. TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 2017, 3, e24. [Google Scholar] [CrossRef] [Green Version]
Wührl, A.; Klinger, R. Recovering Patient Journeys: A Corpus of Biomedical Entities and Relations on Twitter (BEAR). arXiv 2022, arXiv:2204.09952. [Google Scholar]
Zhang, T.; Lin, H.; Ren, Y.; Yang, L.; Xu, B.; Yang, Z.; Wang, J.; Zhang, Y. Adverse drug reaction detection via a multihop self-attention mechanism. BMC Bioinform. 2019, 20, 479. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sahu, S.K.; Anand, A. Drug-drug interaction extraction from biomedical texts using long short-term memory network. J. Biomed. Inform. 2018, 86, 15–24. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Tang, B.; Chen, Q.; Wang, X. Drug-drug interaction extraction via convolutional neural networks. Comput. Math. Methods Med. 2016, 2016, 6918381. [Google Scholar] [CrossRef] [Green Version]
Quan, C.; Hua, L.; Sun, X.; Bai, W. Multichannel convolutional neural network for biological relation extraction. BioMed Res. Int. 2016, 2016, 1850404. [Google Scholar] [CrossRef] [Green Version]
Li, F.; Zhang, M.; Fu, G.; Ji, D. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinform. 2017, 18, 198. [Google Scholar] [CrossRef] [Green Version]
Henry, S.; Buchan, K.; Filannino, M.; Stubbs, A.; Uzuner, O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. 2019, 27, 3–12. [Google Scholar] [CrossRef]
Fang, X.; Song, Y.; Maeda, A. Joint Extraction of Clinical Entities and Relations Using Multi-head Selection Method. In Proceedings of the 2021 International Conference on Asian Language Processing (IALP), Singapore, 11–13 December 2021; pp. 99–104. [Google Scholar]
Santosh, T.; Chakraborty, P.; Dutta, S.; Sanyal, D.K.; Das, P.P. Joint Entity and Relation Extraction from Scientific Documents: Role of Linguistic Information and Entity Types. In Proceedings of the Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents, Virtual, 30 September 2021. [Google Scholar]
Zaikis, D.; Vlahavas, I. TP-DDI: Transformer-based pipeline for the extraction of Drug-Drug Interactions. Artif. Intell. Med. 2021, 119, 102153. [Google Scholar] [CrossRef]
Fatehifar, M.; Karshenas, H. Drug-drug interaction extraction using a position and similarity fusion-based attention mechanism. J. Biomed. Inform. 2021, 115, 103707. [Google Scholar] [CrossRef]
Wang, D.; Fan, H.; Liu, J. Drug-Drug Interaction Extraction via Attentive Capsule Network with an Improved Sliding-Margin Loss. In Proceedings of the International Conference on Database Systems for Advanced Applications, Taipei, Taiwan, 11–14 April 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 612–619. [Google Scholar]
Xu, J.; Lee, H.J.; Ji, Z.; Wang, J.; Wei, Q.; Xu, H. UTH_CCB System for Adverse Drug Reaction Extraction from Drug Labels at TAC-ADR 2017. In Proceedings of the Text Analysis Conference (TAC), Gaithersburg, MA, USA, 13–14 November 2017. [Google Scholar]
Li, F.; Zhang, Y.; Zhang, M.; Ji, D. Joint Models for Extracting Adverse Drug Events from Biomedical Text. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–16 July 2016; Volume 2016, pp. 2838–2844. [Google Scholar]
Sboev, A.; Sboeva, S.; Moloshnikov, I.; Gryaznov, A.; Rybka, R.; Naumov, A.; Selivanov, A.; Rylkov, G.; Ilyin, V. Analysis of the Full-Size Russian Corpus of Internet Drug Reviews with Complex NER Labeling Using Deep Learning Neural Networks and Language Models. Appl. Sci. 2022, 12, 491. [Google Scholar] [CrossRef]
Herrero-Zazo, M.; Segura-Bedmar, I.; Martínez, P.; Declerck, T. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. J. Biomed. Inform. 2013, 46, 914–920. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wishart, D.S.; Knox, C.; Guo, A.C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34, D668–D672. [Google Scholar] [CrossRef] [PubMed]
Sboev, A.; Selivanov, A.; Moloshnikov, I.; Rybka, R.; Gryaznov, A.; Sboeva, S.; Rylkov, G. Extraction of the Relations among Significant Pharmacological Entities in Russian-Language Reviews of Internet Users on Medications. Big Data Cogn. Comput. 2022, 6, 10. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Hamborg, F.; Meuschke, N.; Breitinger, C.; Gipp, B. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science (ISI 2017), Berlin, Germany, 13–15 March 2017; pp. 218–223. [Google Scholar]
Tutubalina, E.; Alimova, I.; Miftahutdinov, Z.; Sakhovskiy, A.; Malykh, V.; Nikolenko, S. The Russian Drug Reaction Corpus and neural models for drug reactions and effectiveness detection in user reviews. Bioinformatics 2020, 37, 243–249. [Google Scholar] [CrossRef]
Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. In Proceedings of the Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, Moscow, Russia, 29 May–1 June 2019; pp. 333–339. [Google Scholar]
Yalunin, A.; Nesterov, A.; Umerenkov, D. RuBioRoBERTa: A pre-trained biomedical language model for Russian language biomedical text mining. arXiv 2022, arXiv:2204.03951. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yasunaga, M.; Leskovec, J.; Liang, P. LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 8003–8016. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Alrowili, S.; Shanker, V. BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, 11 June 20221; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 221–227. [Google Scholar]
Selivanov, A.; Gryaznov, A.; Rybka, R.; Sboev, A.; Sboeva, S.; Klyueva, Y. Relation Extraction from Texts Containing Pharmacologically Significant Information on base of Multilingual Language Models [in press]. In Proceedings of the 6th International Workshop on Deep Learning in Computational Physics (DLCP-2022), Dubna, Russia, 6–8 July 2022; Sissa Medialab Srl.: Trieste, Italy, 2022. [Google Scholar]
Segura-Bedmar, I.; Martínez, P.; Herrero-Zazo, M. SemEval-2013 Task 9: Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA, USA, 14–15 June 2013; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 341–350. [Google Scholar]
Lai, T.; Ji, H.; Zhai, C.X.; Tran, Q.H. Joint biomedical entity and relation extraction with knowledge-enhanced collective inference. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, Bangkok, Thailand, 1 August 2021; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2021; pp. 6248–6260. [Google Scholar]
Luo, L.; Yang, Z.; Cao, M.; Wang, L.; Zhang, Y.; Lin, H. A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature. J. Biomed. Inform. 2020, 103, 103384. [Google Scholar] [CrossRef]

Figure 1. Examples of complex cases in annotation. (a) “Then something or someone advised me to still put at least Pentaxim, which seems to be an inactivated vaccine and a minimum of consequences.”, (b) “But one fact—in the side effects it is written that a fatal outcome is possible”, (c) “For such a hypertensive patient with experience like me, Barboval advised my attending physician to take it at night. Insomnia began to torment me recently” (d) “By simply applying a sponge to the wound, the blood began to be absorbed, and after a few minutes the bleeding stopped as before our eyes”).

Figure 2. An example of annotated review. The numbers in the left part of annotations indicate the contexts to which the entities belong. In this particular example, there will be 9 positive and 6 negative relations Drugname–SourceInfodrug, 2 positive and 1 negative Diseasename–Indications, 3 positive Drugname–Diseasename and 3 negative ADR–Drugname.

Figure 3. An example of predictions for both approaches with XLMR-sag on translated part of the text #517 from RDRS dataset (models are fine-tuned on the extended version of the dataset). Here, blue boxes are related to the sequential approach, green—to the joint, yellow—to both approaches. Ex. 1. Both approaches have the correct prediction on relation between “Grammidin Neo pills with anesthetic” and “sore throat”—adverse effect; Ex. 2. Joint model has correct prediction on information source, leading to correct relation prediction. Sequential approach did not predict “after watching the advertisements” entity of type SourceInfoDrug; Ex. 3. Joint model included additional words in the Indication entity (“throat still hurts” instead of “throat …hurts”), sequential predicted two entities in place of one (“throat” and “hurts” separately, resulting in two incorrect predicted relations); Ex. 4. Sequential model has wrong “reading” of the context leading to many false positive relations—“catch a cold” was just definition of the common situation, not a specific disease of the author of the review.

Table 1. Examples of entity relationships based on context annotation.

Named Entity (Contexts)	Named Entity (Contexts)	Relation Type
Alcon Torbex (1,3)	doctor (1)	Drugname-SourceInfodrug_1
Torbex (1,3)	doctor (1)	Drugname-SourceInfodrug_1
Torbex (1,3)	doctor (1)	Drugname-SourceInfodrug_1
Alcon Torbex (1,3)	wrote out (1)	Drugname-SourceInfodrug_1
Torbex (1,3)	wrote out (1)	Drugname-SourceInfodrug_1
Torbex (1,3)	wrote out (1)	Drugname-SourceInfodrug_1
Alcon Torbex (1,3)	maternity hospital (3)	Drugname-SourceInfodrug_1
Torbex (1,3)	maternity hospital (3)	Drugname-SourceInfodrug_1
Torbex (1,3)	maternity hospital (3)	Drugname-SourceInfodrug_1
conjuctivitis (1,2)	my eyes hurt (1,2)	Diseasename-Indications_1
conjuctivitis (1,2)	couldn’t open them (1,2)	Diseasename-Indications_1
Alcon Torbex (1,3)	conjuctivitis (1,2)	Drugname-Diseasename_1
Torbex (1,3)	conjuctivitis (1,2)	Drugname-Diseasename_1
Torbex (1,3)	conjuctivitis (1,2)	Drugname-Diseasename_1
red spots appeared under my eyes (2)	Alcon Torbex (1,3)	ADR-Drugname_0
red spots appeared under my eyes (2)	Torbex (1,3)	ADR-Drugname_0
red spots appeared under my eyes (2)	Torbex (1,3)	ADR-Drugname_0
Alcon Torbex (1,3)	ophtalmologist (2)	Drugname-SourceInfodrug_0
Torbex (1,3)	ophtalmologist (2)	Drugname-SourceInfodrug_0
Torbex (1,3)	ophtalmologist (2)	Drugname-SourceInfodrug_0
Alcon Torbex (1,3)	wrote out (2)	Drugname-SourceInfodrug_0
Torbex (1,3)	wrote out (2)	Drugname-SourceInfodrug_0
Torbex (1,3)	wrote out (2)	Drugname-SourceInfodrug_0
conjuctivitis (1,2)	eyes start to turn sour (3)	Diseasename-Indication_0

Table 2. Statistics on the main and extended corpora. Total in main—the sum of 5 folds without an extended part of the RDRS datset. For the relations number, the number of related entities of this type is indicated, and in brackets the number of unrelated ones.

Value	Total in Main	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Extension
text number	2798	559	559	560	560	560	1023
avg text length (in words)	157	157	159	156	156	156	176
ADR number	1809	380	390	329	382	328	3241
Drugname number	8467	1695	1664	1736	1686	1686	3345
SourceInfodrug number	2595	530	523	508	519	515	1167
Diseasename number	4026	801	769	760	842	854	908
Indication number	4843	1014	982	965	936	946	2613
ADR_Drugname relations number	4274 (910)	844 (166)	832 (168)	812 (210)	1004 (212)	782 (154)	8443 (2288)
Drugname_Disease relations number	11135 (2108)	2168 (383)	2107 (361)	2138 (429)	2311 (541)	2411 (394)	3012 (510)
Drugname_Info relations number	7056 (1279)	1481 (297)	1404 (229)	1381 (316)	1382 (225)	1408 (212)	3251 (927)
Disease_Indication relations number	7093 (742)	1469 (144)	1223 (177)	1443 (126)	1446 (136)	1512 (159)	2117 (572)

Table 3. ADE dataset information.

Number of	Total	Folds
Number of	Total	1	2	3	4	5	6	7	8	9	10
sentences	4272	427	427	427	427	427	427	427	427	427	429
entities	10,839	1070	1126	1091	1068	1054	1126	1071	1085	1079	1069
AdE	5776	576	616	574	564	561	607	552	577	582	567
Drug	5063	494	510	517	504	493	519	519	508	497	502
Drug-AdE	6821	666	724	688	657	648	732	666	704	688	648

Table 4. DDI dataset information. BLURB Version.

Value	Total	Train	Valid	Test
sentence number	8920	6970	642	1308
avg text length (in words)	20	22	24	14
Total entity number	19,179	14,677	1438	3064
Drug entity number	12,123	9364	883	1876
Group entity number	4373	3375	330	668
Brand entity number	1906	1435	102	369
Drug_n entity number	777	503	123	151
Total relation number	11,811	3941	2109	5761
Mechanism relations number	1755	1319	134	302
Effect relations number	2141	1608	173	360
Int relations number	295	188	11	96
Advise relations number	1134	826	87	221
Generated_false relations number	28,568	22,082	1704	4782

Table 5. The differences between sequential and joint approaches.

Difference Feature	Sequential Approach	Joint Approach
NER solver	Token classification	Span classification
Overlapping entites	Yes	No
Discontionous entities	Yes	No
Number of models	Model for each step	One model
Loss function	NER loss, RE Loss	Joint NER-RE loss
Training process	One for each step	One for joint stage

Table 6. Layer sizes and number of parameters of language models used. H # denotes a number of the hidden layers, HLS—a hidden layer size, ILS—an intermediate layer size between transformer blocks, AH #—a number of attention heads in the transformer block, vocab—the size of the language model vocabulary, params—an approximate number of parameters (weights) of the model with unique values. *—BioALBERT has a lower number of parameters because hidden layers share the same weights in ALBERT architecture. However, there are more neurons, therefore, more weights (just with the same value).

Language Model	H #	HLS	ILS	AH #	Vocab	Params
RuBERT	12	768	3072	12	119,547	178 M
RuDr-BERT	12	768	3072	12	119,547	178 M
EnRuDr-BERT	12	768	3072	12	119,547	178 M
BioALBERT *	12	4096	16,384	64	30,000	222 M *
BioLink BERT	24	1024	4096	16	28,895	333 M
RuBioRoBERTa	24	1024	4096	16	50,265	355 M
XLMR	24	1024	4096	16	250,002	559 M
XLMR-sag	24	1024	4096	16	250,002	559 M

Table 7. Accuracy on the RDRS dataset (joint approach). ADR-Drug—ADR-Drugname, Drug-Dis—Drugname-Diseasename, Drug-Info—Drugname-SourceInfoDrug, Dis-Ind—Disease-Indication.

Corpora	Model	ADR-Drug	Drug-Dis	Drug-Info	Dis-Ind	F1-macro
RDRS- 2800	RuBERT	41.6	63.1	45.3	34.2	46.0
	RuDr BERT	42.2	64.0	44.5	33.9	46.1
	EnRuDr BERT	41.7	62.7	44.1	33.0	45.4
	RuBioRoBERTa	41.9	63.1	43.5	33.5	45.5
	XLMR	51.2	69.3	49.2	38.6	52.1
	XLMR-sag	51.0	68.3	49.1	38.9	51.8
RDRS- 3800	RuBERT	44.0	64.7	44.7	34.0	46.8
	RuDr BERT	45.2	64.2	44.6	34.0	47.0
	EnRuDr BERT	46.2	64.1	45.9	35.1	47.8
	RuBioRoBERTa	41.2	62.1	43.3	33.8	45.1
	XLMR	52.5	69.6	49.3	39.0	52.6
	XLMR-sag	55.6	69.8	49.6	39.1	53.5

Table 8. Accuracy on the extended version of the RDRS dataset using joint (J.) and sequential (S.) approaches (A.). XLMR—XLM-RoBERTa-large, XLMR-sag—XLM-RoBERTa-sag, ADR-Drug—ADR-Drugname, Drug-Dis—Drugname-Diseasename, Drug-Info—Drugname-SourceInfoDrug, Dis-Ind—Disease-Indication.

Corpus	A.	Model	ADR-Drug	Drug-Dis	Drug-Info	Dis-Ind	F1-macro
RDRS- 2800	J.	XLMR	51.2	69.4	49.2	38.6	52.1
	J.	XLMR-sag	51.1	68.3	49	38.9	51.8
	S.	XLMR	46.1	69.2	45.1	32.2	48.1
	S.	XLMR-sag	49.4	70.4	48.3	36.7	51.2
RDRS- 3800	J.	XLMR	52.5	69.6	49.3	39.0	52.6
	J.	XLMR-sag	55.5	69.8	49.6	39.1	53.5
	S.	XLMR	54.7	71.1	49.8	34.6	52.6
	S.	XLMR-sag	55.7	71.4	49.7	35.8	53.2

Table 9. Named-entity recognition accuracy on the RDRS dataset using joint (J.) and sequential (S.) approaches (A.). XLMR—XLM-RoBERTa-large, XLMR-sag—XLM-RoBERTa-sag.

Corpus	A.	Model	ADR	Drug	Disease	Info	Indication	F1-macro
RDRS-2800	J.	XLMR	64.8	95.7	89.4	62.5	72.9	77.1
	J.	XLMR-sag	63.8	96.0	89.7	63.3	73.2	77.2
	S.	XLMR	49.6	95.1	87.7	55.6	64.7	70.5
	S.	XLMR-sag	54.7	95.3	88.3	60.0	67.2	73.1
RDRS-3800	J.	XLMR	64.4	96.0	89.5	63.1	72.5	77.1
	J.	XLMR-sag	65.4	96.3	89.9	63.5	72.8	77.6
	S.	XLMR	60.1	95.7	89.2	60.8	68.9	74.9
	S.	XLMR-sag	60.1	95.7	89.0	60.3	69.0	74.8

Table 10. An accuracy of NER and RE tasks on ADE dataset (joint approach).

Source	Language Model	NER + RE F1-macro	NER F1-macro
ours	Joint (BioLinkBERT)	80.5	90.4
ours	Joint (XLMR-sag)	80.6	90.4
ours	Joint (XLMR)	81.9	91.0
ours	Joint (BioALBERT)	84.2	92.1
[63]	SpERT (BioBERT)	82.0	91.2
[85]	KECI	81.7	90.7
[8]	SpERT (BERT)	78.9	89.3
[60]	Stacked biLSTM	71.4	84.6

Table 11. An accuracy of NER and RE tasks on DDI dataset (joint approach).

Source	Language Model	NER + RE F1-micro	NER + RE F1-macro	NER F1-macro
ours	Joint (BioLinkBERT)	74.2	59.3	75.9
ours	Joint (XLMR)	73.4	63.5	78.0
ours	Joint (XLMR-sag)	77.3	68.4	78.7
ours	Joint (BioALBERT)	79.9	73.7	81.8
[64]	TP-DDI	82.4	-	-
[86]	Att-BiLSTM-CRF	75.1	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sboev, A.; Rybka, R.; Selivanov, A.; Moloshnikov, I.; Gryaznov, A.; Naumov, A.; Sboeva, S.; Rylkov, G.; Zakirova, S. Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora. Mathematics 2023, 11, 354. https://doi.org/10.3390/math11020354

AMA Style

Sboev A, Rybka R, Selivanov A, Moloshnikov I, Gryaznov A, Naumov A, Sboeva S, Rylkov G, Zakirova S. Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora. Mathematics. 2023; 11(2):354. https://doi.org/10.3390/math11020354

Chicago/Turabian Style

Sboev, Alexander, Roman Rybka, Anton Selivanov, Ivan Moloshnikov, Artem Gryaznov, Alexander Naumov, Sanna Sboeva, Gleb Rylkov, and Soyora Zakirova. 2023. "Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora" Mathematics 11, no. 2: 354. https://doi.org/10.3390/math11020354

APA Style

Sboev, A., Rybka, R., Selivanov, A., Moloshnikov, I., Gryaznov, A., Naumov, A., Sboeva, S., Rylkov, G., & Zakirova, S. (2023). Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora. Mathematics, 11(2), 354. https://doi.org/10.3390/math11020354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accuracy Analysis of the End-to-End Extraction of Related Named Entities from Russian Drug Review Texts by Modern Approaches Validated on English Biomedical Corpora

Abstract

1. Introduction

2. Related Works

2.1. Named Entity Recognition

2.2. Relation Extraction

2.3. Joint Solution of the Tasks

3. Materials and Methods

3.1. Data

3.1.1. Russian Drug Review Corpus (RDRS)

3.1.2. English-Language Corpora

3.2. Sequential Approach

3.3. Joint Approach

3.4. Comparative Analysis of Both Approaches

3.5. Language Models

4. Experiments and Results

4.1. Description of the Experiments

4.2. Results on the RDRS Dataset

4.3. Results on the English-Language Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI