You are currently viewing a new version of our website. To view the old version click .
Big Data and Cognitive Computing
  • Article
  • Open Access

17 January 2022

Extraction of the Relations among Significant Pharmacological Entities in Russian-Language Reviews of Internet Users on Medications

,
,
,
,
,
and
1
National Research Centre “Kurchatov Institute”, 123182 Moscow, Russia
2
Moscow Engineering Physics Institute, National Research Nuclear University, 115409 Moscow, Russia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Knowledge Modelling and Learning through Cognitive Networks

Abstract

Nowadays, the analysis of digital media aimed at prediction of the society’s reaction to particular events and processes is a task of a great significance. Internet sources contain a large amount of meaningful information for a set of domains, such as marketing, author profiling, social situation analysis, healthcare, etc. In the case of healthcare, this information is useful for the pharmacovigilance purposes, including re-profiling of medications. The analysis of the mentioned sources requires the development of automatic natural language processing methods. These methods, in turn, require text datasets with complex annotation including information about named entities and relations between them. As the relevant literature analysis shows, there is a scarcity of datasets in the Russian language with annotated entity relations, and none have existed so far in the medical domain. This paper presents the first Russian-language textual corpus where entities have labels of different contexts within a single text, so that related entities share a common context. therefore this corpus is suitable for the task of belonging to the medical domain. Our second contribution is a method for the automated extraction of entity relations in Russian-language texts using the XLM-RoBERTa language model preliminarily trained on Russian drug review texts. A comparison with other machine learning methods is performed to estimate the efficiency of the proposed method. The method yields state-of-the-art accuracy of extracting the following relationship types: ADR–Drugname, Drugname–Diseasename, Drugname–SourceInfoDrug, Diseasename–Indication. As shown on the presented subcorpus from the Russian Drug Review Corpus, the method developed achieves a mean F1-score of 80.4% (estimated with cross-validation, averaged over the four relationship types). This result is 3.6% higher compared to the existing language model RuBERT, and 21.77% higher compared to basic ML classifiers.

1. Introduction

The developing ecosystem of social networks and other special Internet platforms expands the possibility of discussion of a broad set of topics in textual format. These texts often contain people’s publicly available opinions on various subjects. One of the topics of special interest is Internet reviews on medications, including information about their positive and adverse effects, qualities, manufacturers, administration regime etc. Such information could be useful for comprehensive analysis for the purposes of pharmacovigilance [] and potential medicine re-profiling.
Analysing such a large amount of information is a time-consuming task, therefore requiring methods for automated extraction of pharmacologically-meaningful data. In turn, these methods require textual corpora with annotation of pharmacological entities and their relations.
There is a wide variety of English-language datasets in literature sources, for example Drug—Drug Interaction (DDI) and Adverse Drug Event (ADE). These corpora contain pharmaceutically relevant entities of different types as well as relationships between them. A more detailed analysis of the corpora is presented in Section 2. However, currently there is only one large domain-oriented dataset in the Russian language: Russian Drug Review Corpus of Internet User Reviews with Complex NER labeling (RDRS), which was presented by our group [,]. Now, we present (in Section 3.1) an extension of this corpus that includes annotation of relationships among the named entities that are most relevant for the potential studies of drug efficiency.
The automation of the process of extracting meaningful information from a review written in a natural language requires solving the following tasks: text segmentation, Named Entity Recognition (NER), Relation Extraction (RE), structuring of the extracted information, and evaluation of the results. In this paper, we focus on the relation extraction task. The formulation of the relations extraction task in natural language processing is as follows: given a text and two entities from it, determine if there is a relation of a certain type between the entities. For example, in the text “Antiviral syrup for children Orvirem—we have an allergy to it!” with the entities “Orvirem” and “allergy”, the task is to determine that the allergy is mentioned as the adverse effect of the “Orvirem” medication.
The relation extraction task can be solved by two approaches: the sequential (cascade) approach of solving the named entity recognition and relation extraction tasks separately, or the combined approach of solving these tasks simultaneously (the combined approach, called “joint” or “end-to-end” in the literature). The sequential solution allows estimating the accuracy of solving each task separately, thus leading to more thorough analysis of the task complexity; therefore, the scope of our research is to analyze the relation extraction model within the sequential approach, applied to entities already extracted.
Our review (see Section 2) showed that the most promising technology that can be utilized to solve the relation extraction task is deep learning. This paper uses a model based on the XLM-RoBERTa language model, pre-trained on a huge unlabelled corpus of drug reviews. Section 3 contains the details of the model configuration and setting.
Based on this model, a set of computational experiments are performed (Section 5) on different parts of the RDRS corpus. In Section 5.1, the optimal model parameters and text representation are obtained using a part of the corpus that includes texts with ADR–Drugname relations. Section 5.2 presents evaluations on a subset of the corpus containing reviews with multiple contexts. This experiment is aimed at obtaining the state-of-the art results for the task of relation extraction for the following four relation types: ADR–Drugname, Drugname–Diseasename, Drugname–SourceInfoDrug, Diseasename–Indication. The results of the proposed model are compared with the results of the existing language model RuBERT, as well as a set of baseline methods: multinomial naive bayes classifier, linear support vector machine, and dummy classifiers.
The main contributions of our work include:
  • The relation extraction method is proposed, in which the task of determining the presence of a relation is formulated using multi-context annotation: entities belonging to the same context are considered to be related. The method is based on a language model fine-tuned to classify entity pairs by the presence of relations.
  • Several variations of the text representation used to present the entities under consideration to the language model are compared, and the optimal representation is shown to be the one that includes the text of target entities along with the whole review text, concatenated with special tokens;
  • The method based on a language model trained on a large corpus of unlabeled Russian drug review texts and fine-tuned on an annotated corpus of Russian drug reviews is shown to be applicable to the task of determining the relations among pharmaceutically-relevant entities of the newly-created corpus. The accuracy estimations are obtained for this task for Russian language;
  • The same proposed model, pre-trained on Russian drug reviews, is shown to achieve relation extraction results comparable to the state of the art on the DDI corpus.

3. Materials and Methods

3.1. Datasets

This paper uses the Russian Drug Review Corpus (RDRS) [], which contains 2800 texts of drug reviews written by Internet users. The corpus contains markup for 18 types of named entities, which can be divided into 3 groups:
  • Medication—this group includes everything related to the mentions of drugs and drugs manufacturers, including: Drug name, Drug class, Drug form, Route (how to use the drug), Dosage, SourceInfoDrug (source of the consumer’s information about the drug) etc.;
  • Disease—this group contains entities related to the diseases or reasons for using the drug (disease name, indications or symptoms), as well as mentions of the effects achieved (NegatedADE—the drug was inefficient, Worse—some deterioration was observed, BNE-POS—the condition improved) etc.
  • ADR—mentions of occurring adverse reactions.
In a subset of the corpus containing 1590 review texts, entities were marked up into “lines of meaning”—“contexts”, so that each context contains entities that describe the usage of some medication by one person for the treatment of one condition. Different contexts arise in a text, in particular, when describing the use of multiple drugs in the treatment, or different effects following the use of a single drug for different conditions, or when the review describes the use of a drug by different people. In terms of the relation extraction problem, entities that occur in the same context are related, while entities from different contexts are considered unrelated.
An example of the context annotation is shown in Figure 1. The main (1st) context of the review is about the drug “orvirem” which caused an allergy. This context includes the following mentions (denoted on the figure with a number 1 above them): “antiviral” (Drugclass), “syrup” (Drugform) “orvirem” (Drugname), multiple mentions of “allergy” (ADR), “red spots” (ADR), “swelling on the face” (ADR), “1 day” (Duration). There are other contexts in the review:
  • 2nd context: “allergy” (Diseasename), “red spots” (Indication), “zyrtek” (Drugname), “the situation did not improve” (NegatedADE), “it seems to have gotten even worse” (Worse).
  • 3d context: “allergy” (Diseasename), “red spots” (Indication), “doctor” (SourceInfoDrug), “On her recommendation” (SourceInfoDrug), “smecta” (Drugname), “the situation did not improve” (NegatedADE), “it seems to have gotten even worse” (Worse).
  • 4th context: “allergy” (Diseasename), “red spots” (Indication), “doctor” (SourceInfoDrug), “On her recommendation” (SourceInfoDrug), “suprastin” (Drugname), “the situation did not improve” (NegatedADE), “it seems to have gotten even worse” (Worse), “Injected” (Route), “The redness seems to pass” (BNE-POS), ““swelling on the face still remains” (NegatedADE).
  • 5th context: “allergy” (Diseasename), “red spots” (Indication), “doctor” (SourceInfoDrug), “prednisone” (Drugname), “Injected” (Route), “The redness seems to pass” (BNE-POS), ““swelling on the face still remains” (NegatedADE).
In Table 1, Table 2, Table 3 and Table 4 the quantitative characteristics of the corpus with contextual markup are presented.
Table 1. The number of texts that contain the corresponding number of contexts.
Table 2. Average lengths of contexts in the corpus.
Table 3. Statistics on the part of RDRS dataset that is comprised of ADR–Drugname relations.
Table 4. Statistics on the types of relations in the RDRS corpus with 908 multi-context reviews.
Figure 1. Example of an annotated review. The labels contain, from left to right: context number, entity type, attribute within the entity type.
In this paper, the following pairs of entities are chosen as the most interesting to analyze from the practical point of view:
  • ADR–Drugname—the relationship between the drug and its side effects;
  • Drugname–SourceInfodrug—the relationship between the medication and the source of information about it (e.g., “was advised at the pharmacy”, “the doctor recommended it”);
  • Drugname–Diseasename—the relationship between the drug and the disease;
  • Diseasename–Indication—the connection between the illness and its symptoms (e.g., “cough”, “fever 39 degrees”).
Two subsets of the original corpus have been compiled for the experiments:
  • The first one includes 628 texts containing ADR and Drugname entity pairs. The experiments on this part are aimed at selecting the most effective combinations of the input feature representations and hyper-parameters of the methods used. The texts of the RDRS corpus that contain ADR and Drugname entities were divided into training and test parts, the composition of which is presented in Table 3.
  • The second part includes texts that contain multiple contexts. The total number of such texts is 908. Statistics on the types of relationships are presented in Table 4. This corpus is used to establish the current level of accuracy in determining the relationships between pharmacologically-significant entities in Russian-language review texts.
Experiments with these subsets are described further in Section 4.

3.2. Methods

3.2.1. Deep Learning Methods

Language Models

In this work the XLM-RoBERTa-sag model [] was used. The original XLM-RoBERTa [] is a multilingual language model based on the transformer [] architecture, consisting of multihead attention layers which create vector representations of the input data parts (words in case of NLP) that encode the information about their context. XLM-RoBERTa is trained on a large multilingual corpus from the CommonCrawl project that contains 2.5 TB of texts. XLM-RoBERTa-sag is a result of additional training of XLM-RoBERTa on a dataset of unlabeled internet texts about medicines (~1.65 M texts).
During the adjustment experiments, we used two versions of the model:
  • XLM-RoBERTa-base-sag—12 Transformer blocks, 768 hidden neurons, 8 Attention Heads, 125 millions of parameters, 2 epochs of additional training on Russian texts about medications;
  • XLM-RoBERTa-large-sag—24 Transformer blocks, 1024 hidden neurons, 16 Attention Heads, 355 millions of parameters, 1 epoch of additional training on Russian texts about medications;
Text preprocessing includes splitting it into words or word parts—“tokens”. For XLM-RoBERTa-sag, as well as for the original XLM-RoBERTa, such splitting is performed using the SentencePiece tokenizer [].

Input Text Pre-Processing

To solve the classification task, transformer-based language models use a special token [CLS] added to the input sequence. During training, the loss functionn is aimed at class prediction based on the vector of the [CLS] token. That way, the model learns to create such a vector representation of the [CLS] token that accumulates information about the text as a whole and is informative in terms of the current task being solved.
In the approach proposed in this work, the classification is performed on the basis of the information about a pair of entities for which the existence of a relationship is determined, and the text that mentions this pair. Figure 2 shows the conceptual scheme of our approach to solving the relation extraction task using a language model.
Figure 2. Conceptual scheme of our approach to relation extraction based on a language model.
For providing the language model with the information about which entities are of interest, several text representation variants are considered in our experiments:
  • The whole text—the tokenized input text that the language model receives at its input is the whole drug review text, in which target entities are highlighted using special start and end tokens, e. g. [T_ADR] and [\T_ADR] for an entity of type ADR:
    <<[CLS]Antiviral syrup for children [T_DRUG]“Orvirem”[\T_DRUG] - We have an allergy to it! We have a severe allergy after the first day of taking it. Moreover, the boy (3.5 years old) had no allergies to any drugs before. In the morning he woke up covered in [T_ADR]red spots[\T_ADR]. I immediately gave him zyrtek…>>
  • The text of target entities only—only the mentions of the target entities are used as the input text;
  • The text of the target entities and the text between the mentions of the target entities;
  • The text of the target entities concatenated with the whole text:
    [CLS]<<text of first target entity>>[SEP]<<text of second target entity>>[TXTSEP]<<whole text of the drug review>>.
Here, the token [SEP] is placed between the two target entities, and the token [TXTSEP] separates the pair of entities from the whole text.
Potentially, this way of organizing the input data makes it possible to build a more informative vector representation due to the Attention mechanism inside the Transformer layers, and facilitates solving the problem in a classification formulation. The effectiveness of such a text representation was demonstrated previously [].
As mentioned before, there are many degrees of freedom in such models that require consideration in order to achieve higher accuracy. Within the scope of the current research, the following options have been analyzed:
  • Maximum input sequence length (in tokens);
  • Learning rate;
  • Batch size;
  • Maximum learning epoch number;
  • Learning rate decay technique [];
  • Early stopping technique [].

3.2.2. Other Machine Learning Methods

Basic machine learning methods perform quite effectively in many applications [,,]. These methods are highly efficient in terms of computational complexity, due to this fact, it is possible to search for the optimal set in an extensive space of hyperparameters and to test hypotheses relatively quickly.
The first goal of using basic machine learning methods was to obtain a reasonable baseline for the relation extraction task in the pharmacological domain in the Russian language, exceeding the random guess of the “Dummy” models’ results, for the purpose of comparison with the deep learning models described in the previous section.
As a textual data representation for the baseline methods, concatenation of frequency features (term frequency-inverse document frequency, TF-IDF) of the character n-grams of the target entities was used. The size of the n-gram n and the frequency filter of the tf-idf method were considered as the hyperparameters to tune during the experiments.
The second goal of using basic machine learning methods was to check if the information about the entities’ text is sufficient to achieve competitive accuracy for the task.
The following methods were considered in the experiments as basic machine learning methods:
  • Logistic regression []—a basic linear model for text classification using a logistic function to estimate the probability of an example to belong to a certain class;
  • Support vector machine []—a linear model based on building a hyperplane that maximizes the margin between two classes;
  • Multinomial Naive Bayes model []—a popular solution for baselines in such text analysis tasks as spam filtering or text classification. It performs text classification based on words’ or n-grams’ co-occurrence probability;
  • Gradient Boosting []—a strong decision tree-based ensemble model, which iteratively “boosts” the result of each tree by building a next tree that should classify examples that the previous tree fails to classify correctly.
In addition, for comparison, the RuBERT [] language model was considered, which is a BERT [] model with 12 layers, 768 hidden neurons each, 12 attention heads, 180 M parameters. RuBERT was trained on the Russian part of Wikipedia and news data. When solving the problem, the language model is used to form a vector representation of the text, which is fed into the linear layer. The output activities of the linear layer are used to determine if there is a relationship between the pair of entities fed to the input.

3.2.3. Dummy Models

“Dummy” models were considered to be the low-level baseline. Such models generate labels randomly or according to some simple principle. The following methods were checked as methods for “dummy” classification:
  • most frequent class labeling—every pair of entities is assigned to the most frequent class in the dataset (in case of extraction of ADR–DrugName relations in the RDRS dataset, thus classifier considers every pair to have a relation);
  • uniform random labeling—labels are predicted randomly according to a uniform probability distribution, without taking into account any characteristics of the input dataset;
  • stratified random labeling—labels are predicted randomly but from the distribution corresponding to that of the input data: the probability of an input example to belong to a class is proportional to the portion of examples of such class in the dataset.
The accuracy of the “dummy” methods based on the random label generation was averaged over 100 launches in order to operate with more stable results and to prevent possible occurrence of random outliers.

4. Experiments

4.1. Accuracy Metric

The performance of a model on the relation extraction task is estimated by the f1-macro metric, in which the f1 score is calculated separately for each class:
f 1 score = 2 · P · R P + R ,
P = T P T P + F P ,
R = T P T P + F N .
Here P is precision, the proportion of correctly predicted objects of the class A under consideration to the number of objects that the model assigned to class A; R is recall, the proportion of correctly predicted items of class A to the real number of items in class A; T P is the number of true positive instances, the number of relations of class A correctly identified by the model; F P is the number of false positive examples, the number of relations assigned to class A while actually having a different class; F N is false negative, the number of relations that actually have class A while being incorrectly assigned to a different class by the model.
The overall performance of the model is estimated by averaging the f1-score over the two classes. This method of averaging allows for the uneven numbers of relations in the different classes.

4.2. Selection of the Model Features and Hyperparameters

In these experiments we use a subset of RDRS that contains texts with the ADR and Drugname entities only. The following experimental setup is used:
  • Fixed stratified split into training (80%) and testing (20%) sets; In order to avoid overfitting, entity pairs from each review all go either to the training set or to the testing set, but no review is split between the sets;
  • Hyperparameters of the language model’s fine-tuning process are searched manually so that to maximize the accuracy (by the f1-macro metric) on the validation part of the training set, without taking into account the testing set;
  • The language model involves early stopping and learning rate decay (Experiments show the positive effect of such techniques on the model accuracy);
The experiments on language models have been carried out using a computing cluster node with the following configuration: CPU Intel® Xeon™ E5-2650v2 (2.6 GHz) × 8, 128 Gb RAM, NVIDIA Tesla V100 (16 Gb).

4.3. An Estimation of Efficiency of Selected Methods

After finding the optimal model parameters, the efficiency of the methods has been assessed on a part of the RDRS containing review texts with multiple contexts. Accuracy is measured by the f1-score metric with cross-validation over 5 splits: the data is divided into 5 equal parts, and at each iteration of the cross-validation 80% of the texts are used for fine-tuning the model and 20% for testing.
For a more complete assessment, we compare the proposed method to other machine learning methods different in terms of complexity and type, as well as to a “Dummy” classifier based on the probability distribution of positive and negative examples of the pairs of entities in question (Stratified random labeling).
“Dummy” models and basic machine learning method experiments have been carried out on a local machine with the following configuration: CPU Intel® Core™ i5-7400 @ 3.00 GHz × 4, 16 Gb RAM. The experiments with language models were performed on the same equipment as the experiments in the previous section.
The programming language python 3.8 and software libraries numpy [], sklearn [], pytorch [] and simpletransformers [] were used for software implementation of the described method. As part of a series of experiments, the parameters of the python random number generator, as well as the random number generators of numpy, sklearn, and pytorch libraries were fixed to ensure repeatability of the experiments.

5. Results

5.1. Comparison of the Model Features and Hyperparameters

This section compares the results of experiments on the identification of the entity relations using XLM-RoBERTa-large-sag and XLM-RoBERTa-sag with different input representations described in Section 3.2.1.
The comparison shows that the language model should receive both the target entities separated from the text and the entire text in order to achieve high accuracy and to outperform basic machine learning methods. The f1-macro achieved for ADR–Drugname relations from the RDRS dataset is 95% (see Table 5). This estimation is 41% higher than random class prediction, and 20% higher than basic machine learning models, even if the hyperparameters of the latter are tuned on a test set.
Table 5. Accuracies (by the f1-macro metric) of XLM-RoBERTa-base-sag (denoted “LM-Base”) and XLM-RoBERTa-large-sag (“LM-Large”) language models with different methods of text representation.
The optimal hyperparameter values found for XLM-RoBERTa-base-sag are:
  • maximum input length—512;
  • early stopping—active;
  • learning rate—0.00005;
  • batch size—32;
  • maximum epochs—10;
  • learning rate decay—active;
The resulting hyperparameter values for XLM-RoBERTa-large-sag are:
  • maximum input length— 512;
  • early stopping—active;
  • learning rate—0.00001;
  • batch size—8 (there was not enough memory for bigger batch size with XLM-RoBERTa-large);
  • maximum epochs—10;
  • learning rate decay—active;

5.2. Estimation of the Relation Extraction Efficiency

As a result of the experiments conducted on the 908 reviews from the RDRS corpus that have multi-context annotation, accuracy has been estimated for the task of determining the relationships between pharmacologically-significant entities using the method developed on base of the XLM-RoBERTa language model. The accuracy of the proposed method in comparison with the baseline classifiers is given in Table 6. Accuracy is measured by the f1-score metric averaged over five cross-validation splits and is presented separately for the positive (relation present) class and the negative (no relation) class. The results for the baseline machine learning methods are obtained with input represented by target entity pairs encoded with tf-idf of n-grams of 3–8 characters.
Table 6. Accuracy of predicting relations (pos) and absense of relations (neg) between entity pairs of different types in multicontext reviews from the RDRS dataset.
As follows from this table, the proposed model determines the four relations under consideration with the following accuracy (according to the f1-score metric for the positive class): between adverse drug reactions and drugs (ADR–Drugname) 92.7%, between drugs and diseases (Drugname–Diseasename) 89.9%, between a drug and its source of information (Drugname–SourceInfoDrug) 92.9%, between diseases and symptoms (Diseasename–Indication) 87.1%. This is 43.5%, 40%, 41.5%, 38.2% higher than the accuracy of the dummy classifier and higher than the accuracy of RuBERT by 3.9%, 3.8%, 3.5%, 2.1% respectively. At the same time, for the class without the relation between entities (negative class), the accuracy is more volatile, taking the values of 91.1%, 76.2%, 82.7%, 31%. However, they exceed the Dummy Classifier accuracy by 59.3%, 42.9%, 49.8%, 9% and RuBERT by 14.9%, 10.0%, 20.1%, 3.3% respectively. On average, the developed model outperforms RuBERT by 3.6%, achieving the f1-score of 80.4%.

5.3. Applying the Proposed Approach to the DDI Dataset

In order to test the applicability of the proposed model to the texts in another language, we estimated our model on the well-known Drug-Drug Interaction (DDI) dataset [], used on the SemEval-2013 challenge as a dataset for biomedical relation extraction.
The DDI dataset is a manually annotated corpus consisting of 792 texts selected from the DrugBank database and other 233 Medline abstracts. The dataset has been annotated with a total of 18,502 pharmacological substances and 5028 relations. The dataset includes named entities of the following types:
  • Drug—used to annotate those human medicines known by a generic name;
  • Brand—drugs described by a trade or brand name;
  • Group—drug interaction descriptions often include groups of drugs, that were separated to “group” entity type;
  • Drug_n— active substances that weren’t approved for human use, such as toxins or pesticides.
The relations annotated in the dataset are four types of drug-drug interactions (DDIs):
  • Mechanism— this type is used to annotate DDIs that are described by their pharmacokinetic mechanism;
  • Effect—this type is used to annotate DDIs describing an effect or a pharmacodynamic mechanism;
  • Advice—This type is used when a recommendation or advice regarding a drug interaction is given;
  • Int—This type is used when a DDI appears in the text without providing any additional information.
When applying the proposed model to the DDI dataset, the model has been fine-tuned to DDI, but pre-training on Russian texts has remained the same. For the fine-tuning and testing on DDI, the data split is the same as in the BLURB project []—624/90/191 documents for train/validation/test sets respectively.
Experiments with the text representation “target entities and the whole text” (described in Section 3.2.1) yield the micro-averaged f1-score value of 71.2%. We have therefore modified the input text representation for higher accuracy on the DDI dataset. Inspired by the entity screening technique from the literature [], we have employed both highlighting the target entity mentions with tags and concatenating target entity mentions with the texts of the whole reviews. For example, the text: “Cytochrome P-450 inducers, such as phenytoin, carbamazepine and phenobarbital, induce clonazepam metabolism, causing an approximately 30% decrease in plasma clonazepam levels.” was represented as: “phenytoin [SEP] carbamazepine [TXTSEP] Cytochrome P-450 inducers, such as [T]phenytoin[/T], [T]carbamazepine[/T] and phenobarbital, induce clonazepam metabolism, causing an approximately 30% decrease in plasma clonazepam levels.”
The resulting accuracy by f1-metric, micro-averaged over the four relation classes, is 81.46%, which is comparable to the accuracy other language model-based approcahes [,,] achieve for determining relations between entities extracted from this dataset, the state of the art being 84.05% [].

6. Discussion

Table 6 shows that there is a volatility between different relation types in terms of accuracy. Preliminary analysis of the relations of different types shows that the Diseasename–Indication relation has the following distinctive features: low number of the negative class samples (pairs of entities of the desired type that have no relation), high fraction of the unique pairs of entities (approx. 65%); high fraction (approx. 35%) of the unique relations that are represented with different classes in different texts (the same entity pair that has a relation in one text may have no relation in the other text). All these factors—unevenness of classes and the ambiguity of the relation existence between mentions of the same entities in different texts—make the classification task more difficult for the machine learning model. As a solution, we consider conducting further research of the data structure and the classification results, as well as extending our dataset by more relations that have lower representation in the corpus.
Overall, the developed approach shows the highest accuracy out of a group of methods considered: the language model RuBERT, trained on the Russian Wikipedia and news, classic machine learning algorithms (LinearSVM and Multinomial Naive Bayes) and the baseline“dummy” method of Stratified Random Labeling.
Though accuracy is the key performance value of the machine learning models, another important metric is their computational complexity. XLM-RoBERTa is the most resource-intensive model among those considered—it has approximately 550 million parameters, while RuBERT has approximately 110 million parameters. A limitation of XLM-RoBERTa-large is that it requires a GPU to work efficiently.
Another limitation of the transformer language models related to the computational complexity is a limit on the input sequence length—the input of the base BERT model cannot have a size larger than 512 tokens, while RoBERTa-large has this limitation set to 1024 tokens. In the case of longer texts, special approaches are needed in order to work efficiently and to use information about the whole text.
It is worth mentioning that this work considers the relation extraction task based on the ground truth named entity annotation, therefore, further research is required to determine the method’s efficiency when the named entities are predicted by another model.

7. Conclusions

The research conducted shows the strong dependency of the accuracy of the entity relation identification on the structure of the input text representation when using pre-trained language models based on the Transformer topology. The highest accuracy is obtained with our proposed model XLM-RoBERTa-large-sag with texts represented in the following form: the text of the first entity of the potential relation, followed by the text of the second entity of the potential relation and the whole input text. The information contained inside the text between the target entities proved to be insufficient to achieve the same accuracy with the same model, presenting the entire review text to the language model is thus necessary.
The average f1-score obtained over 4 relation types is 80.4%. At the same time, the RuBERT model yields a 3.6% lower f1-score, Linear SVM—21.77% lower, baseline stratified random labeling method—30.4% lower.
On the DDI dataset the same model achieves 81.46%, which is comparable to the state-of-the-art 84.05% obtained on that dataset by other language models trained on pre-defined NER annotation.
Another important observation is the volatility of the accuracy across the relation types, which could be explained by the imbalance in the number of relations of different types, in the number of unique representatives of entity mentions, and the distribution of the relations of particular type between training and test subsets. The issue could be corrected with enlarging the corresponding parts of the context-labeled dataset and with balancing the numbers mentioned above.
Overall, the results obtained provide the state-of-the-art accuracy level for the task of pharmacological entity relation identification in Russian-language reviews and could be positioned as a basis for the future tasks of automated analysis of medical reviews.

Author Contributions

Conceptualization, A.S. (Alexander Sboev) and R.R.; methodology, A.S. (Alexander Sboev) and I.M.; software, A.S. (Anton Selivanov), G.R., I.M. and A.G.; validation, G.R. and A.S. (Anton Selivanov); investigation, A.S. (Alexander Sboev), A.S. (Anton Selivanov), G.R. and I.M.; resources, A.S. (Alexander Sboev) and R.R.; data curation, A.S. (Alexander Sboev), S.S. and A.G.; writing—original draft preparation, R.R., A.S. (Anton Selivanov) and A.S. (Alexander Sboev); writing—review and editing, A.S. (Alexander Sboev), R.R. and A.S. (Anton Selivanov); visualization, A.G. and G.R.; supervision, A.S. (Alexander Sboev); project administration, R.R.; funding acquisition, A.S. (Alexander Sboev). All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Russian Science Foundation grant No. 20-11-20246.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data can be obtained through sending a request from the website of our project: https://sagteam.ru/en/med-corpus/ (accessed on 30 October 2021). Trained models are presented on the page of our team on the huggingface repository: https://huggingface.co/sagteam (accessed on 30 October 2021). The code is available at https://github.com/sag111/Relation_Extraction (accessed on 30 October 2021).

Acknowledgments

This work has been carried out using computing resources of the federal collective usage center Complex for Simulation and Data Processing for Mega-science Facilities at NRC “Kurchatov Institute”, http://ckp.nrcki.ru/ (accessed on 30 October 2021).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Segura-Bedmar, I.; Martínez, P. Pharmacovigilance through the development of text mining and natural language processing techniques. J. Biomed. Inform. 2015, 58, 288–291. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Sboev, A.; Sboeva, S.; Gryaznov, A.; Evteeva, A.; Rybka, R.; Silin, M. A neural network algorithm for extracting pharmacological information from russian-language internet reviews on drugs. J. Phys. Conf. Ser. 2020, 1686, 012037. [Google Scholar] [CrossRef]
  3. Sboev, A.; Sboeva, S.; Moloshnikov, I.; Gryaznov, A.; Rybka, R.; Naumov, A.; Selivanov, A.; Rylkov, G.; Ilyin, V. An analysis of full-size Russian complexly NER labelled corpus of Internet user reviews on the drugs based on deep learning and language neural nets. arXiv 2021, arXiv:cs.CL/2105.00059. [Google Scholar]
  4. Oliveira, A.; Braga, H. Artificial Intelligence: Learning and Limitations. Wseas Trans. Adv. Eng. Educ. 2020, 17, 80–86. [Google Scholar] [CrossRef]
  5. Al-Haija, Q.A.; Jebril, N. A Systemic Study of Pattern Recognition System Using Feedback Neural Networks. Wseas Trans. Comput. 2020, 19, 115–121. [Google Scholar] [CrossRef]
  6. Ganesh, P.; Rawal, B.; Peter, A.; Giri, A. POS-Tagging based Neural Machine Translation System for European Languages using Transformers". Wseas Trans. Inf. Sci. Appl. 2021, 18, 26–33. [Google Scholar] [CrossRef]
  7. Xu, H.; Van Durme, B.; Murray, K. BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6663–6675. [Google Scholar]
  8. Ge, Z.; Sun, Y.; Smith, M. Authorship attribution using a neural network language model. In Proceedings of the AAAI Conference on Artificial Intelligence, Burlingame, CA, USA, 8–12 October 2016; Volume 30. [Google Scholar]
  9. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Volume 1. [Google Scholar]
  10. Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
  11. Portelli, B.; Passabi, D.; Serra, G.; Santus, E.; Chersoni, E. Improving Adverse Drug Event Extraction with SpanBERT on Different Text Typologies. In Proceedings of the 5th International Workshop on Health Intelligence (W3PHIAI-21), Palo Alto, CA, USA, 8–9 February 2021. [Google Scholar]
  12. Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A Unified Generative Framework for Various NER Subtasks. arXiv 2021, arXiv:2106.01223. [Google Scholar]
  13. Ge, S.; Wu, F.; Wu, C.; Qi, T.; Huang, Y.; Xie, X. FedNER: Privacy-Preserving Medical Named Entity Recognition with Federated Learning. Available online: https://arxiv.org/abs/2003.09288 (accessed on 30 October 2021).
  14. Wu, S.; He, Y. Enriching pre-trained language model with entity information for relation classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2361–2364. [Google Scholar]
  15. Giorgi, J.; Wang, X.; Sahar, N.; Shin, W.Y.; Bader, G.D.; Wang, B. End-to-end named entity recognition and relation extraction using pre-trained language models. arXiv 2019, arXiv:1912.13415. [Google Scholar]
  16. Eberts, M.; Ulges, A. Span-Based Joint Entity and Relation Extraction with Transformer Pre-Training. In ECAI 2020; IOS Press: Amsterdam, Netherlands, 2020; pp. 2006–2013. [Google Scholar]
  17. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef]
  18. Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. arXiv 2020, arXiv:2007.15779. [Google Scholar] [CrossRef]
  19. Gordeev, D.; Davletov, A.; Rey, A.; Akzhigitova, G.; Geymbukh, G. Relation extraction dataset for the russian language. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog” [Komp’iuternaia Lingvistika i Intellektual’nye Tehnologii: Trudy Mezhdunarodnoj Konferentsii “Dialog”]; Russian State University For The Humanities: Moscow, Russia, 2020. [Google Scholar]
  20. Naseem, U.; Dunn, A.G.; Khushi, M.; Kim, J. Benchmarking for biomedical natural language processing tasks with a domain specific albert. arXiv 2021, arXiv:2107.04374. [Google Scholar]
  21. Ju, M.; Nguyen, N.T.; Miwa, M.; Ananiadou, S. An ensemble of neural models for nested adverse drug events and medication extraction with subwords. J. Am. Med. Inform. Assoc. 2020, 27, 22–30. [Google Scholar] [CrossRef]
  22. Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
  23. Wang, J.; Lu, W. Two Are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1706–1721. [Google Scholar]
  24. Patrick, J.; Li, M. High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge. J. Am. Med. Inform. Assoc. 2010, 17, 524–527. [Google Scholar] [CrossRef] [Green Version]
  25. Anick, P.; Hong, P.; Xue, N.; Anick, D. I2B2 2010 challenge: Machine learning for information extraction from patient records. In Proceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processing for Clinical Data, Boston, MA, USA, 12 November 2010. [Google Scholar]
  26. Henry, S.; Buchan, K.; Filannino, M.; Stubbs, A.; Uzuner, O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. 2019, 27, 3–12. [Google Scholar] [CrossRef]
  27. Herrero-Zazo, M.; Segura-Bedmar, I.; Martínez, P.; Declerck, T. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. J. Biomed. Inform. 2013, 46, 914–920. [Google Scholar] [CrossRef] [Green Version]
  28. Asada, M.; Miwa, M.; Sasaki, Y. Using Drug Descriptions and Molecular Structures for Drug-Drug Interaction Extraction from Literature. Bioinformatics 2020, 37, 1739–1746. [Google Scholar] [CrossRef]
  29. Beltagy, I.; Lo, K.; Cohan, A. SciBERT: Pretrained Language Model for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
  30. Gurulingappa, H.; Rajput, A.M.; Roberts, A.; Fluck, J.; Hofmann-Apitius, M.; Toldo, L. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 2012, 45, 885–892. [Google Scholar] [CrossRef]
  31. Bruches, E.; Pauls, A.; Batura, T.; Isachenko, V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In Proceedings of the 2020 Science and Artificial Intelligence Conference (SAI Ence), Novosibirsk, Russia, 14–15 November 2020; pp. 41–45. [Google Scholar]
  32. Ivanin, V.; Artemova, E.; Batura, T.; Ivanov, V.; Sarkisyan, V.; Tutubalina, E.; Smurov, I. Rurebus-2020 shared task: Russian relation extraction for business. In Computational Linguistics and Intellectual Technologies; Russian State University for the Humanities: Moscow, Russia, 2020; pp. 416–431. [Google Scholar]
  33. Bondarenko, I.; Berezin, S.; Pauls, A.; Batura, T.; Rubtsova, Y.; Tuchinov, B. Using Few-Shot Learning Techniques for Named Entity Recognition and Relation Extraction. In Proceedings of the 2020 Science and Artificial Intelligence Conference (SAI Ence), Novosibirsk, Russia, 14–15 November 2020; pp. 58–65. [Google Scholar]
  34. Loukachevitch, N.; Artemova, E.; Batura, T.; Braslavski, P.; Denisov, I.; Ivanov, V.; Manandhar, S.; Pugachev, A.; Tutubalina, E. NEREL: A Russian Dataset with Nested Named Entities and Relations. arXiv 2021, arXiv:2108.13112. [Google Scholar]
  35. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
  37. Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
  38. Sboev, A.; Selivanov, A.; Rybka, R.; Moloshnikov, I.; Rylkov, G. Evaluation of Machine Learning Methods for Relation Extraction Between Drug Adverse Effects and Medications in Russian Texts of Internet User Reviews. Available online: https://pos.sissa.it/410/006/pdf (accessed on 30 October 2021).
  39. Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
  40. Caruana, R.; Lawrence, S.; Giles, L. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. Adv. Neural Inf. Process. Syst. 2000, 13, 402–408. [Google Scholar]
  41. Sahoo, K.S.; Tripathy, B.K.; Naik, K.; Ramasubbareddy, S.; Balusamy, B.; Khari, M.; Burgos, D. An evolutionary SVM model for DDOS attack detection in software defined networks. IEEE Access 2020, 8, 132502–132513. [Google Scholar] [CrossRef]
  42. Chun, P.J.; Izumi, S.; Yamane, T. Automatic detection method of cracks from concrete surface imagery using two-step light gradient boosting machine. Comput.-Aided Civil Infrastruct. Eng. 2021, 36, 61–72. [Google Scholar] [CrossRef]
  43. Xu, F.; Pan, Z.; Xia, R. E-commerce product review sentiment classification based on a naïve Bayes continuous learning framework. Inf. Process. Manag. 2020, 57, 102221. [Google Scholar] [CrossRef]
  44. Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
  45. Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
  46. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 workshop on empirical methods in artificial intelligence, Seattle, WA, USA, 4 August 2001; Volume 3, pp. 41–46. [Google Scholar]
  47. Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient descent in function space. In Proceedings of the NIPS, Denver, CO, USA, 29 November–4 December 1999; Volume 12, pp. 512–518. [Google Scholar]
  48. Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. In Komp’juternaja Lingvistika i Intellektual’nye Tehnologii; Russian State University For The Humanities: Moscow, Russia, 2019; pp. 333–339. [Google Scholar]
  49. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  50. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
  51. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  52. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 8026–8037. [Google Scholar]
  53. Rajapakse, T.C. Simple Transformers. 2019. Available online: https://github.com/ThilinaRajapakse/simpletransformers (accessed on 30 October 2021).
  54. raj Kanakarajan, K.; Kundumani, B.; Sankarasubbu, M. BioELECTRA: Pretrained Biomedical text Encoder using Discriminators. In Proceedings of the 20th Workshop on Biomedical Language Processing, Online, 11 June 2021; pp. 143–154. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.