PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using Transfer Learning

The challenge of recognizing named entities in a given text has been a very dynamic field in recent years. This is due to the advances in neural network architectures, increase of computing power and the availability of diverse labeled datasets, which deliver pre-trained, highly accurate models. These tasks are generally focused on tagging common entities, but domain-specific use-cases require tagging custom entities which are not part of the pre-trained models. This can be solved by either fine-tuning the pre-trained models, or by training custom models. The main challenge lies in obtaining reliable labeled training and test datasets, and manual labeling would be a highly tedious task. In this paper we present PharmKE, a text analysis platform focused on the pharmaceutical domain, which applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles. It performs text classification using state-of-the-art transfer learning models, and thoroughly integrates the results obtained through a proposed methodology. The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks, centered on the pharmaceutical domain. The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset. Additionally, the PharmKE platform integrates the results obtained from named entity recognition tasks to resolve co-references of entities and analyze the semantic relations in every sentence, thus setting up a baseline for additional text analysis tasks, such as question answering and fact extraction. The recognized entities are also used to expand the knowledge graph generated by DBpedia Spotlight for a given pharmaceutical text.


Introduction
We are currently facing a situation where huge amounts of data are being generated continuously, in all aspects of our lives.The main source of this data are online social media platforms and news portals.Given their volume, it is generally hard for an individual to keep track of all the information stored within the data.Historically, whenever people do not have the capacity to finish a given task, they tend to invent tools that might help them.In this case, we want to use natural language processing (NLP) tools to perform intelligent knowledge extraction (KE) and use it to filter and receive only news which are of interest to us.
In this paper, we are particularly interested in extracting named entities from the pharmaceutical domain, namely entities which represent Pharmaceutical Organizations and Drugs.This NLP task is referred to as named entity recognition (NER) [1,2].It aims to detect the entities of a given type in a text corpora.NER takes a central place in many NLP systems, as a baseline task for information extraction, question answering, and more.
Our interest in this topic stems from a problem we are facing in our LinkedDrugs dataset [3], where the collected drug products can have active ingredients (Drug entities) and manufacturers (Pharmaceutical Organization entities) written in a variety of ways, depending on the data source, country of registration, language, etc.Our initial work showed promising results [4], and we want to build on it.The ambiguity in entity naming in our drug products dataset makes the process of data analysis imprecise, thus using NER to normalize these name values for the active ingredients and manufacturers can significantly improve both the quality of the dataset, as well as the results from any analytical task on top of it.
The recent advances in neural network architectures improved NER accuracy, mainly by leveraging bidirectional longshort term memory (LSTM) networks [5,6], convolutional networks [7], and lately, transformer architectures [8].Many language processing libraries were made available for the public throughout the years [9] from both academia and industry, equipped with highly accurate pre-trained models for extraction of common entity classes, such as Person, Date, Location, Organization, etc.However, as a given business might require detection of more specific entities in text, these models should either be fine-tuned, or trained anew with corresponding datasets for the desired entity types.
The main challenge resides in obtaining a large amount of labeled training data, which is required to train a highly accurate model.Even though multiple manually labeled, highly accurate and generic datasets exist on the Web [10], their usage might not be feasible for the task at hand.Relevant data might either be unavailable on the Internet, or not feasible to be labeled manually.
As a solution to this problem, we propose a methodology that can be used to automatically create labeled datasets for custom entity types, showcased in texts from the pharmaceutical domain.In our case, this methodology is applied by tagging Pharmaceutical Organizations in pharmacy related news.We prove that it can be extended to tagging other custom entities in different texts in the pharmaceutical domain by tagging Drug entities as well, and assessing the obtained results.The main focus is the automatic application of common language processing tasks, such as tokenization, dealing with punctuation and stop words, lemmatization, as well as the possibility for application of custom, business case related text processing functions, like joining consecutive tokens to tag a multi-token entity, or performing text similarity computations.
The overall applicability and accuracy of this methodology is assessed by using two well-known language processing libraries, spaCy [11] and AllenNLP [12], which come with a pre-trained model based on convolutional layers with residual connections and a pre-trained model based on Elmo embeddings [13], respectively.The custom trained models which are able to tag the custom entity Pharmaceutical Organization indicate high tagging accuracy when compared to the initial pre-trained models' accuracy while tagging the more generic Organization entity over the same testing dataset.In addition, a model trained on the same dataset by fine-tuning the state-of-the-art BERT is used for gaining a better insight over the results.Lastly, a fine-tuned BioBERT [14], a model based on BERT architecture and pre-trained on biomedical text corpora, is also used to better assess the results.The thorough explanation of the methodology used to generate the labeled datasets is given in the following sections, followed by custom model training and accuracy assessment.
The extracted entities can help us filter the documents and news which mention them, but in the current era of data overflow, this is not enough.Therefore, we go one step further and integrate these results in a platform which then extracts and visualizes the knowledge related to these entities.This platform currently integrates state-of-the-art NLP models for co-reference resolution [13] and Semantic Role Labeling [15] in order to extract the context in which the entities of interest appear.This platform additionally offers convenient visualization of the obtained findings, which brings the relevant concepts closer to the people which use the platform.This knowledge extraction process is then finalized by generating a Knowledge Graph (KG) using the Resource Description Framework (RDF) [16] -a graph-oriented knowledge representation of the entities and their relations.This provides two main advantages: the RDF graph-data model allows seamless integration of the results from multiple knowledge extraction processes of various news sources within the platform, and at the same time links the extracted entities to their counterparts within DBpedia [17] and the rest of the Linked Data on the Web [18].This provides the users of the platform with a uniform access over the entire knowledge extracted within the platform, and the relevant linked knowledge already present in the publicly available knowledge graphs.

Related Work
Named entity recognition (NER), as a key component in NLP systems for annotating entities with their corresponding classes, enriches the semantic context of the words by adding hierarchical identification.Currently, there is a lot of new work being done in this field, especially in the process of neural networks optimization for label sequencing, which outperform early NER systems based on domain dictionaries, lexicons, orthographic feature extraction and semantic rules.Starting with [19], neural network NER systems with minimal feature engineering have become popular, due to the performances they achieve.They do so by introducing unified task-independent neural sequence labeling models, using convolutional neural networks (CNN) and n-dimensional representations of words.
Character-level models treat text as distributions over characters and they are able to generate embeddings for any string of characters within any textual context.With this, they improve the generalization of the model on both frequent and unseen words, which makes them popular in the biomedical domain.A model based on stacked bidirectional long short-term memory (LSTM) is introduced in [20].This model inputs characters and outputs tag probabilities for each character, achieving state-of-the-art NER performance in seven languages without using additional lexicons and hand-engineered features.In [21], authors present a language model composed of a CNN and LSTM, where they use characters as input to form a word representation for each token in the sentence, thus they outperform word/morphemelevel LSTM baselines.
In [22], authors propose a Biomedical Named Entity Recognition (Bio-NER) method based on a deep neural network architecture, which leverages word representations pre-trained on unlabeled data collected from the PubMed database with a skip-gram language model.In [23], authors utilized word embedding techniques to capture the semantics of the words in the sentence and built a generic model based on long short-term memory network-conditional random field (LSTM-CRF), which outperforms state-of-the-art entity-specific NER tools.
Starting from 2018, Sequence-to-Sequence (Seq2Seq) architectures which work with text became a popular topic in NLP, due to their powerful ability to transform a given sequence of elements into another sequence -a concept which fits well in machine translation.Transformers are models which implement Seq2Seq architecture by using an encoder-decoder structure.
One of the latest milestones in this development is the release of Google's BERT [8] which is based on a transformer architecture and integrates an attention mechanism [24].It produces outstanding results on many NLP tasks, including NER, due to its ability to learn contextual relations between words (or sub-words) in a text, making it applicable in the biomedical and pharmaceutical domains.Hakala and Pyysalo [25] present an approach based on Conditional Random Fields (CRF) and multilingual BERT for biomedical named entity recognition on content in Spanish.In [26], authors explore feature-based and fine-tuning training strategies for the BERT model for NER in Portuguese.Lamurias and Couto [27] present an approach based on a transformer architecture for question answering in the biomedical domain.
BioBERT [28] is a domain-specific language representation, pre-trained on large scale biomedical corpora.It is pre-trained on large general domain corpora (English Books, Wikipedia, etc.) and on biomedical domain corpora (PubMed abstracts, PMC full-text articles), using the BERT architecture.This language model provides improved results in various biomedical text mining tasks, including NER.
Transfer learning, as a machine learning method, provides the concept of re-usability in neural networks, where one model developed for a task can be reused as the starting point of the training process of another problem that has a significantly smaller training set.In recent years, transfer learning is one of the most popular approaches in computer vision and NLP tasks, since it out-performs the state-of-the-art models in many use-cases, and does so by using smaller training sets for fine-tuning and far less computational resources.
Transfer learning has enabled an increase of the F1 score for co-reference resolution tasks over the past few years, allowing it to reach a satisfying average of 73%.This task is focused on clustering mentions within a text that refer to the same underlying real-world entities.Different approaches use biLSTM and attention mechanisms to compute span representations and then find co-reference chains through a softmax mention ranking model [29].Adding ELMO and coarse-to-fine & second-order inference to this approach has resulted in a significant improvement over the F1 score achieving the above mentioned average of 73%.This task is evaluated with the OntoNotes co-reference annotations from the CONLL2012 shared task [30], which involved predicting co-reference in English, Chinese, and Arabic, using the final version (5.0) of the OntoNotes corpus.It provides an accurate and integrated annotation of multiple levels of the shallow semantic structure in text in multiple languages.
On the other hand, applying transfer learning to the task of semantic role labeling shows that applying a simple BERTbased model can achieve state-of-the-art performance compared to the previous state-of-the-art neural models that incorporated lexical and syntactic features, such as part-of-speech tags and dependency trees [15].The reason lies in the fact that semantic role labeling can be decomposed into four tasks: predicate detection, predicate sense disambiguation, argument identification, and argument classification, where the predicate disambiguation task is focused on identifying the correct meaning of a predicate in a given context -allowing it to be formulated as a sequence labeling task, where BERT really shines.
There are multiple ways to construct an RDF-based Knowledge Graph (KG), which generally depend on the source data.In our case, we work with extracted and labeled data, so we can utilize existing solutions which recognize and match the entities in our data with their corresponding version in other publicly available KGs.One such tool is DBpedia Spotlight, an open source solution for automatic annotation of DBpedia entities in natural language text [31].It provides phrase spotting and disambiguation, i.e. entity linking, for the provided input.Its disambiguation algorithm is based upon cosine similarities and a modification of TF-IDF weights.The main phrase spotting algorithm is exact string matching, which uses LingPipe's1 Aho-Corasick implementation.
There are many platforms like AllenNLP [12] and Spacy [11], which aim to provide demo pages for NLP model testing, and code snippets for easier usage by the machine learning experts.On the other hand, projects like Hugging Face' Transformers [32] and Deep Pavlov AI [33] are libraries that significantly speed up prototyping and simplify the creation of new solutions based on the existing NLP models.
However, to the best of our knowledge, there is no complete solution for knowledge extraction in the pharmaceutical domain that is human-centric and enables visualisation of the results in a human-understandable format.In this paper, we present a platform which tries to fill this gap.

PharmKE Knowledge Extraction Platform
This section describes our PharmKE platform [34,35], which goes a step further in understanding pharmaceutical texts: on top of identifying Drugs and Pharmaceutical Organizations, it also extracts relations in the mentioned context and constructs a Knowledge Graph from them.The platform covers the entire process of understanding a document and its content -from its classification and filtering, i.e. does it belong to the pharmaceutical domain, all the way to visualization of the entities and their semantic relations, as shown in Fig. 1.Each of the steps is described in more detail within this section.
The PharmKE platform can be formally represented with the following functional expression: (1) The functional expression (1) shows that the platform is designed to combine the best of the available models in each of the steps, while also enabling us to fine-tune some of the models, as is the case with the f ineT unedP harmaN ER model, which is explained in more details in Section 4.

Pharmaceutical Text Detection
At the beginning, the platform classifies whether a given text is from the pharmaceutical domain, and only the positively classified texts are accepted for further analysis.The classification model used in this step is a transferred BERT Figure 1: Platform workflow, available via the public instance of the platform [34].model, fine-tuned with a corpus of 5,000 documents from the pharmaceutical domain as positive samples2 , and general news documents as negative samples 3 .70% of these documents are used for fine-tuning of BERT's and XLNet's models, and their precision, recall and F1 measure is evaluated with the remaining 30% of the documents.Table 1 shows the results obtained by the fine-tuned models.

Pharmaceutical Named Entity Recognition
Each correctly classified pharmaceutical text is further analyzed by recognizing combined entities through the proposed models, as well as by using BioBERT for the detection of BC5CDR4 and BioNLP13CG5 tags [36], which include Disease, Chemical, Cell, Organ, Organism, Gene, etc.Additionally, we use a fine-tuned BioBERT model in order to detect Pharmaceutical Organizations and Drugs, entity classes that are not covered by the standard NER tasks.We explain the fine-tuning process in more details in Section 4. Tag collisions when combining the results from both models are avoided by applying precedence of the tags recognized by our fine-tuned model over the tags recognized by BioBERT's model (Simple Chemical).All of the recognized entities are visualized in the sentence, along with their respective tags.

Co-reference Resolution and Semantic Role Labeling
The recognized entities serve as a baseline for finding all of their mentions in the entire text, by applying co-reference resolution in the background and replacing each mention ("it", "it's", "his", etc.) with their respective entity.Libraries such as AllenNLP, StanfordNLP [37] and NeuralCoref6 provide implementations of the algorithms for co-reference resolution, focused on the CONLL2012 shared task [30].Our platform utilizes the NeuralCoref library for co-reference resolution due to its high accuracy, ease of integration compared to StanfordNLP, and the capability to take into account user-specific information and the speakers in a conversation.
Once the mentions in the text are replaced with their respective entities, the final task includes labeling the semantic roles in each sentence.This is performed by using the BERT-based algorithm for semantic role labeling [15].Then, the concrete arguments, like subject and object, as well as modifier arguments like temporal, location, instrument, etc. are visualized in a sequential manner for quick understanding.
The result is a modular platform for pharmaceutical text analysis, which uses existing state-of-the-art models for entity recognition, as well as fine-tuned models for recognizing custom entities like Pharmaceutical Organization and Drug.The modular design of the platform enables a combination of results from multiple models which recognize a vast range of entities.It also allows for semantic role labeling and visualization for each entity and their respective mentions in the text, by using state-of-the-art algorithms implemented by popular libraries.The entire analysis can be exported in a JSON format, allowing it to be used for additional processing such as question answering, text summarization, fact extraction, etc.

Knowledge Graph Generation
As a final step, we annotate the entire text using the state-of-the-art knowledge extraction system DBpedia Spotlight [38].The obtained results are then enriched with additional RDF facts which we construct from the identified Pharmaceutical Organization and Drug entities.This enriched knowledge graph is then available for further use within or outside the platform.

Entity Recognition for Pharmaceutical Organizations and Drugs
Our methodology starts with a text corpora from the pharmaceutical domain and a closed set of entities that belong to a given class.In our case, we are using entities that denote Pharmaceutical Organizations and Drugs.Using only these two prerequisites, we show that we can train models that can extract even unseen entities from the class of interest.Figure 2 visualizes the whole process.First, we start with the text corpora from the pharmaceutical domain that potentially contains the entities from the class of interest.This text corpora consists of news collected from the following pharmacy related websites: FiercePharma7 , Pharmacist8 and Pharmaceutical Journal9 .Next, we tokenize the text such that we extract the words, and then we try to annotate each word in respect to the set of entities from the required type.We utilize cosine similarity and levenshtein distance in particular [39], where we check if the word is similar to some of the entities.The annotation process assigns start-and end-positions for each token in the text, respectively.Once we are done with this phase, we have initialized a labeled dataset, denoted as MD.

Creating a Labeled Dataset
One of the main challenges is that the Pharmaceutical Organization entity type can be found in a given text as multiword phrases, such as Sanofi Pharmaceuticals Ltd. Spain, or as a single word: Sanofi.Additionally, the name of the Pharmaceutical Organization can contain pharmacy-related keywords, such as Pharmaceuticals, Pharma, Medical, Biotech, etc., which are not part of the core name of the organization, and can either be found along with it in the sentence, or not at all.This means that we should not classify the countries, legal entities, and the pharmacy-related words as parts of the Pharmaceutical Organization type.Therefore, the annotation process sequentially performs use-case-specific token filtering during the creation of the MD dataset.This is done by using a non-entity list which contains all tokens that should be ignored.In our case, this list contains all countries in the world, together with the legal entity types for companies ("Ltd", "Inc", "GmbH", "Corp", etc.) and pharmacy-related words.After filtering out the tokens from the non-entity list, only Sanofi will remain in our example, and we can be certain that the core name is thoroughly extracted.After matching the core name in the text, we use the same lists to detect neighbour tokens for multi token name, if any, as part of the organization name using text similarity metrics.
After the application of the custom, use-case-related filtering, the MD dataset consists of the core entities that have high text similarity.Only the entities which have similarity above the customized threshold are labeled as members of the target class.In our experiments, we use a similarity threshold of 0.9.Some Pharmaceutical Organization entities consist of multiple, consecutive tokens, such as J & J.We solve this by token concatenation of consecutive relevant tokens, using a custom function applied on the MD.
After applying all custom text processing functions, the state of the MD is as shown on Table 2.

Model Fine-Tuning
The MD dataset is then used to train a model which will be able to extract the named entities from the given class.
Since NER models take into consideration the context in which the entities appear in a sentence, the training dataset is not required to contain a huge number of diverse entities.Here we improve the general knowledge language model for the more specific task, using small or moderate amounts of labeled data.
In our case, we fine-tune spaCy, AllenNLP, BERT and BioBERT models.However, each of these models requires a different data format.SpaCy requires an array of sentences with respective tagged entities for each sentence and their start-and end-positions.AllenNLP requires a dataset in BIOUL or BIO notations 10 , which differentiate the following token annotations: • multi-word entity beginning token: (B), • multi-word entity inside tokens: (I) • multi-word entity ending token: (L), • single-token entities: (U), • non-entity tokens: (O).
The dataset adapted for BERT and BioBERT labels the entities with I -PH_ORG, regardless of the number of tokens, while all other tokens are marked with O.
Therefore, we use different dataset serializers to output the training and test datasets for the fine-tuning process, in the required format.
The same methodology is used for creating labeled datasets for the Drug entity type.In this case we use the same text corpora, but this time annotated with a fairly larger set of Drug entities.
Once we are done with the fine-tuning process, we have named entity recognition models able to extract the entities from a given type.

Evaluation
The accuracy of our proposed approach is assessed by using a pharmacy-related news dataset, which consists of 5000 news.The Pharmaceutical Organization entities set consists of 3,633 unique values, while the Drug entities set consists of 20,266 unique drug brand names.These sets were extracted and published as part of our previous work [3][4].
The evaluation is performed in two distinct scenarios for both entity classes.In the first evaluation, we split the news dataset into training and test portions, with sizes of 70% and 30% respectively, with no consideration of the distribution of the entities inside.This scenario aims to check the overall precision of the fine-tuned model.In the second evaluation scenario, we evaluate the generalization ability of our approach.Here, we split the training and test portions based on the entities they contain, such that there will not be any entity overlap between them.To do so, we extract the documents that contain 30% of the entities as the testing portion, and the other news are used for training.However, the testing portion contained more than 30% of the overall news.Therefore, in order to achieve a 70% -30% ratio between the training and test portions, the test portion was reduced to contains exactly 30% of the news, while in the rest of the documents, the entities were replaced with other entities which do not belong to the entity set used in the testing portion.The obtained fine-tuned models for detecting Pharmaceutical Organization entities using spaCy, AllenNLP, BERT and BioBERT were tested accordingly, and the results were compared to the original models before their fine-tuning, where the task was the extraction Organization entities.The results are given in Table 3, indicating that the fine-tuned models are able to achieve significantly higher F1 score compared to the original models.Also, we can outline that AllenNLP outperforms spaCy in this NER task, a result that can be attributed to the different neural architectures used by both libraries, while the BERT model is able to outperform both.However, the pre-trained BioBERT on biomedical text is able to slightly outperform BERT in every evaluation.
Even though the pre-trained models take into consideration the sentence context in which the entities appear, we can evaluate the fine-tuned model generalization capability by creating a test dataset that contains only entities that were not seen during the training.To achieve this, we use the joint dataset of the pharmacy-related news and generate a sample of entities in a random way to achieve a 70% -30% split ratio between training and test datasets, where the test dataset contains entities not encountered in the training dataset.
SpaCy, AllenNLP, BERT and BioBERT models were also trained using these datasets, and the results are given in Table 4.To better visualize the accuracy, Fig. 3 denotes a sentence extracted from pharmacy-related news where the Pharmaceutical Organization entities are recognized as expected.SpaCy, AllenNLP, BERT and BioBERT models were also created for recognizing Drug entities in texts.The evaluation results are given in Table 5 for the scenario where the same Drug entity can be present in both the training and the test dataset, while Table 6 shows the results when the test dataset does not contain any of the entities used in the training phase.Again, the train-test dataset ratio is 70% -30%.To better visualize the accuracy, Fig. 4 denotes a sentence extracted from pharmacy-related news, where the Drug entity is recognized as expected.

Knowledge Graph Generation and Enrichment
As a final step in the pipeline, we want to generate an RDF knowledge graph (KG) with the knowledge extracted from the previous steps.One way to create a general-purpose knowledge graph is to use a tool such as DBpedia Spotlight [38], which performs recognition of interlinked entities in the DBpedia knowledge graph.So, in theory, it can be used to recognize the drugs and pharmaceutical organizations in the texts of interest, and correctly annotate them with their semantic type.However, our experiments showed that the annotated entities are of more general types, such as schema:Organization11 or dbpedia:Company 12 .In addition to that, most drug entities referenced by their brand names are not annotated at all.Therefore, we decided to use the results obtained so far by the pipeline described in the previous sections, to expand the knowledge graph generated by DBpedia Spotlight with specific types: schema:MedicalOrganization13 for the recognized pharmaceutical organizations, and schema:Drug14 , dbpedia:Drug15 for the recognized drugs.
To properly test the benefits of this knowledge graph enrichment, we decided to apply the technique on the test set which contains texts with previously unknown entities while training the named entity recognition models.The results show an average expansion of 47.69% on the originally generated knowledge graph by DBpedia Spotlight.Figure 5 shows an example knowledge graph for a given input text, extracted using the DBpedia Spotlight annotation tool (left), and the enriched knowledge graph with additional knowledge about MedicalOrganization and Drug entities (right).The additional RDF triples are highlighted.
Figure 6 shows the overall knowledge enrichment obtained by our system for the test dataset.It presents the ratio between the number of texts and the percentage of knowledge enrichment.This overview indicates a normal distribution of the enrichment over the test set.The knowledge graph generated and enriched as part of the pipeline, can then be used for other purposes within or outside the platform.We are currently providing an RDF output in Turtle syntax 16 .

Discussion
The platform presented in this paper emphasizes a methodology for combining the best-performing NLP models and adopting them for use in a new domain.We use a modular approach, where each model is a separate phase in the knowledge extraction pipeline, and allows for an easy upgrade with new and potentially superior models, therefore improving the performance of the entire platform.
In contrast to [12][11][32] [33], the goal of our platform is to provide a knowledge extraction solution for the pharmaceutical domain that brings the state-of-the-art NLP achievements closer to the people which analyze large amounts of texts.The PharmKE platform is human-centric, meaning that it is designed to be used primarily by people who need to extract the knowledge.The outcome from each phase is visualized, which enables the users to better understand the process of capturing and linking this knowledge.Since the web browser may not be the most convenient tool for domain experts to use in the process of knowledge extraction, especially when they analyze texts from various sources, we are also publishing an Application Programming Interface (API) that exposes the results from our platform to other applications.With this, we enable the development of editor plugins which will potentially extract and visualize the knowledge in the tools that experts already use on a daily basis.
In the current version of the PharmKE platform, we fine-tuned the Named Entity Recognition module to extract two additional entity types, namely Pharmaceutical Organization and Drug, on top of the entity types already recognized by the superior BioBERT model.During the fine-tuning phase, we show a method for automatically creating the training set for the recognition of Pharmaceutical Organizations and Drugs, by using a text corpora from the pharmaceutical domain and a closed set of entity instances from the types of interest.The evaluation of the fine-tuned model showed that this methodology enables recognition of entities that are not seen in the training set, which is a promising result.
The knowledge graph which we generate and enrich at the end of the pipeline is aimed to show the possibility of packaging and reusing the knowledge generated by the pipeline in other software solutions.Namely, even though the platform is human-centric, generating an RDF knowledge graph as the final step in the process means that the results can be stored, shared, combined with other RDF knowledge graphs and (re)used programmatically, outside of the platform.The nature of RDF and knowledge graphs allows for an almost seamless combination of the results from the platform with other RDF data which exists publicly or internally in the user environment.
The PharmKE platform is open to the continuous advancements in the NLP field.One of the crucial elements in the process of the knowledge extraction that is not solved by the current models is the linking of the relations obtained by the SRL model with the corresponding properties in the knowledge graph.This is the challenge that our team will try to address in our future research, as well as incorporating any model that will have better results in some of the current tasks.All of this is possible thanks to the modular design of the platform.Another challenge will be cleaning up the knowledge graph from erroneous conclusions made by the pipeline, which is a standard and expected problem with NLP.

Conclusion
In this paper, we present a modular platform [34,35] that incorporates state-of-the-art models for text categorization, pharmaceutical domain named entity recognition (NER), co-reference resolution (CRR), semantic role labeling (SRL) and knowledge extraction (KE).This platform is designed primarily for human users.PharmKE visualizes the results from each of the incorporated models, enabling pharmaceutical domain experts to better recognize the extracted knowledge from the input texts.
Our strategic goal is to keep the PharmKE platform current and up-to-date, and its modular design enables easy incorporation of new and potentially superior models.One such step in this direction was our extension of the more recent BioBERT model for NER with the Pharmaceutical Organization and Drug entity type recognition.
The platform is also publicly available [34] and is open-source [35], providing reproducibility of our results.This also means that other researchers can modify their own copy of the platform, run their own instances of it and even re-purpose it, thanks to its modular design.
A common issue while training custom models for language understanding tasks in text, is the lack of labeled datasets for testing and training.To tackle this issue, we propose a methodology that can be used to automate the labeled dataset creation process for training models for custom entity tagging.The methodology was assessed by training custom models for named entity recognition using spaCy, AllenNLP, BERT and BioBERT, and the obtained results indicate that the newly trained models outperform the pre-trained models in detecting custom entities.

Future Work
Evaluating the performance of the proposed methodology on pharmaceutical texts gives satisfying results.However, a better oversight could be obtained with testing the methodology on various texts with different context, that can either include or not entities from the pharmaceutical domain.With this, we could evaluate the performance of the methodology in a generalized manner and compare the results to the current, task-specific evaluation.This would enable its usage in a variety of domains for training diverse models.
Shifting our focus towards the platform, the extracted semantic roles can be further parsed into RDF triples which comprise a knowledge graph.A platform optimization is planned as part of the future work that would enable main-tenance of the knowledge graph in the background, which would be continuously enriched with every text analysis performed by the platform.
The presence of a knowledge graph in the system will enable easy access and extraction of facts by performing simple queries over the graph, and going further, it can be interconnected with other relevant knowledge graphs of the user, or public ones.

Figure 3 :
Figure 3: Detecting Pharmaceutical Organization entities in text.

Figure 5 :
Figure 5: Original knowledge graph generated by DBpedia Spotlight (left) and the expanded knowledge graph (right).The additional RDF triples are highlighted.

Figure 6 :
Figure 6: Distribution of knowledge graph enrichment among the texts from the test set.

Table 2 :
State of MD after the application of the custom text processing functions.

Table 3 :
Evaluation of models trained on a dataset that contains known entities.

Table 4 :
Evaluation of the models on previously unseen entities.

Table 5 :
Evaluation of models trained on a dataset that contains known entities.

Table 6 :
Evaluation of the models on previously unseen entities.