Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach

Tamla, Philippe; Freund, Florian; Hemmje, Matthias

doi:10.3390/info16050395

Open AccessArticle

Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach

by

Philippe Tamla

,

Florian Freund

^*

and

Matthias Hemmje

Faculty of Mathematics and Computer Science, University of Hagen, 58097 Hagen, Germany

^*

Author to whom correspondence should be addressed.

Information 2025, 16(5), 395; https://doi.org/10.3390/info16050395

Submission received: 27 March 2025 / Revised: 5 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

(This article belongs to the Special Issue Advanced Retrieval-Augmented Generation Systems Based on Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a cloud-based system that builds upon the FIT4NER framework to support medical experts in training machine learning models for named entity recognition (NER) using Microsoft Azure. The system is designed to simplify complex cloud configurations while providing an intuitive interface for managing and converting large-scale training and evaluation datasets across formats such as PDF, DOCX, TXT, BioC, spaCyJSON, and CoNLL-2003. It also enables the configuration of transformer-based spaCy pipelines and orchestrates Azure cloud services for scalable and efficient NER model training. Following the structured Nunamaker research methodology, the paper introduces the research context, surveys the state of the art, and highlights key challenges faced by medical professionals in cloud-based NER. It then details the modeling, implementation, and integration of the system. Evaluation results—both qualitative and quantitative—demonstrate enhanced usability, scalability, and accessibility for non-technical users in medical domains. The paper concludes with insights gained and outlines directions for future work.

Keywords:

named entity recognition; machine learning; information retrieval; cloud

1. Introduction and Motivation

Medical Information Retrieval (IR) is crucial for the development of Clinical Practice Guidelines (CPGs), which are essential tools used daily by medical professionals to support clinical decision-making [1]. Developing these guidelines requires medical experts to process vast amounts of unstructured, domain-specific data [2]. In this context, Named Entity Recognition (NER) has emerged as a vital tool for enhancing the accessibility and extraction of relevant information from such data [3]. By identifying Named Entities (NEs) within unstructured, natural language texts, domain experts can more efficiently extract pertinent information for the development of CPGs [2]. Modern NER techniques, tailored to specific knowledge domains like medicine, necessitate the training of Machine Learning (ML) models with extensive datasets to optimize performance [4,5]. Cloud technologies, such as Microsoft Azure, offer dynamic and scalable infrastructures that facilitate the provisioning and processing of large datasets, essential for training NER models and enabling advanced information management and extraction [6]. Building on the introduction of NER, ML, and Cloud as foundational techniques for effective IR in the medical field, this research is motivated by various projects.

RecomRatio [7] is a European research and development project designed to assist medical professionals in decision making by presenting treatment options along with their respective advantages and disadvantages. It leverages the Content and Knowledge Management Ecosystem Portal (KM-EP) Knowledge Management System (KMS) to facilitate evidence-based patient decisions derived from the medical literature. KM-EP was developed in collaboration between the Fernuniversität Hagen’s Faculty of Mathematics and Computer Science, Chair of Multimedia and Internet Applications [8], and the FTK e.V. Research Institute for Telecommunications and Cooperation [9]. The system streamlines the acquisition of new knowledge within medical documents by identifying emerging NEs [10]. The Stanford Named Entity Recognition and Classification (SNERC) [11] project concentrates on developing a NER system within the realm of Applied Gaming. It democratizes access to ML, enabling non-experts to identify NEs and classify texts using a rule-based expert system. However, SNERC is limited to relying on Stanford CoreNLP as the underlying foundation for NER. Within the scope of the H2020 project proposal Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3), a hub-based architecture was proposed [12]. This architecture is designed not only to process medical records and patient data but also to act as an assistant in explaining medical decisions. It includes a central Artificial Intelligence (AI) hub and a Semantic Fusion Layer to store data and AI models. AI4H3 employs KlinSH-EP, which is based on the KM-EP system [13]. The Cloud-based Information Extraction (CIE) [6] project was started in the context of AI4H3 to support a cloud-based NER pipeline, enabling the processing of large datasets. The CIE architecture aims to allow various users to customize cloud resources and train NER models within the cloud environment. Embedded within CIE, the Framework-Independent Toolkit for Named Entity Recognition (FIT4NER) [14] project seeks to assist medical experts in medical text analysis using different NER frameworks. This approach enables experimentation with multiple frameworks, comparison of results, and selection of the best method for the specific task at hand. Within the context of FIT4NER and CIE, the Cloud Resource Management for Named Entity Recognition (CRM4NER) system [15] was developed to enable NER model training using cloud resources, with optimization through automated parameter search and context-dependent fine-tuning using a Large Language Model (LLM). However, CRM4NER is currently limited to the AWS Cloud. Additionally, the need to annotate training data in various formats is crucial for enhancing the compatibility of models developed with tools like CRM4NER [16]. Therefore, this study focuses on streamlining the NER model development process within another widely used cloud environment, Azure, while supporting diverse data formats for NER model training in the medical domain. This leads to the following Research Questions (RQs), explored in both a bachelor’s thesis [17] and a master’s thesis inspired by CIE and FIT4NER [18]:

RQ1: How can a system be developed to support medical experts in creating training data and in training NER models within the Azure Cloud environment?
RQ2: How can medical experts be empowered to manage Azure Cloud resources for NER model training?

This work applies the Nunamaker methodology [19], a well-established systematic approach to developing information systems. This methodology defines specific Research Objectives (ROs) across the phases of Observation, Theory Building, System Development, and Experimentation. Building on and extending the FIT4NER framework, this study applies these phases to develop and evaluate a cloud-based NER system for medical experts using Microsoft Azure. Section 2 addresses the observation objectives, offering a critical review of the current state of research and identifying any gaps relevant to this study. Section 3 focuses on theory building, creating fundamental models that enable medical experts to develop NER models in the Azure Cloud. Section 4 details the system development objectives, including the creation of a prototype to assist in NER model development within Azure. Section 5 covers the experimentation phase, discussing the execution and analysis of both quantitative and qualitative evaluations of the system. The concluding Section 6 summarizes the findings and insights from this research.

2. State of the Art in Science and Technology

This section concentrates on the observation phase and RO1, which involves a “critical review of the current state of research”. The objective is to identify and discuss Remaining Challenges (RCs) in the areas covered by this article.

As part of the FIT4NER project, a process model was developed to assist medical professionals in extracting knowledge for the creation of CPGs using AI [2]. This extraction relies on NER, which today often employs ML methods [20]. The phases described in the model, “Data Management & Curation” and “Analytics”, include the development of training data, the evaluation and selection of appropriate NER frameworks, and the training of ML-supported NER models [2]. Given that such models require large amounts of data, FIT4NER also leverages the CIE project, which provides use cases for cloud-based resources to train the NER models [6]. However, medical experts face certain challenges when utilizing ML-supported NER: Firstly, generating high-quality training data is a significant challenge [20]. Robust NER models require high-quality annotated training data [16,20]. Medical texts can exist in various formats such as PDF, DOCX, HTML, or plain text [21,22]. In addition, there are specialized formats designed for the exchange of medical information, including Digital Imaging and Communications in Medicine (DICOM) [23], which is used primarily for medical images but also contains text in its metadata. Other common formats include Fast Healthcare Interoperability Resources (FHIR) [24] and Consolidated Clinical Document Architecture (CCDA) [25], which both provide standardized structures for organizing and sharing clinical data. These formats support efficient processing and analysis of medical information. However, to be used effectively in NER systems, these data must first be converted, annotated, and adapted to match the format requirements of the selected NER framework. Two widely used formats for this purpose are CoNLL-2003 [26] and BioC [27]. The CoNLL-2003 format, developed for a language technology competition, is used for structured annotated linguistic data, while BioC is an XML-based format specifically designed for representing and exchanging biomedical texts along with their annotations. Secondly, configuring different NER frameworks can be a hurdle for medical professionals due to their varying functions and interfaces [14]. Thirdly, efficiently managing cloud resources for training ML-supported NER models is another significant challenge [6,28]. Cloud configuration in various environments such as Microsoft Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP) is complex and requires specialized knowledge [28,29]. Additionally, cost-effective resource management in the cloud is essential to avoid unnecessary expenses [30]. Fourthly, medical data often reside in knowledge management systems like Clinical Decision Support Systems (CDSSs) [31]. Therefore, integrating ML training functions for NER models into these systems would be beneficial for efficient use by medical professionals. These challenges are discussed in detail in the following sections.

To support medical experts in generating high-quality training data, the Prepare4NER system was developed [32]. Prepare4NER assists medical professionals in creating and maintaining extensive training and test datasets across various text formats and NER frameworks [32] and allows for the easy integration of new NER frameworks through customizable converters. For these reasons, Prepare4NER was selected for creating training data in this work. The first remaining challenge (RC1) is to integrate Prepare4NER within the Azure Cloud environment. To apply NER for the development of CPGs, various frameworks can be employed [2]. Recent advancements in transformer models and LLMs like GPT-4 have significantly enhanced NER, document classification, and IR by leveraging large datasets and sophisticated neural architectures [33]. Platforms like HuggingFace offer pre-trained language models, including those for NER, available for free download [34]. spaCy is an NER framework that supports these advancements, delivering state-of-the-art performance while being open-source and freely available [20,35]. It also allows for the use and fine-tuning of pre-trained models from HuggingFace for NER. Therefore, this work utilizes spaCy and HuggingFace models. The second remaining challenge (RC2) is to facilitate the use of spaCy with HuggingFace transformer models for medical experts. Several cloud providers offer services for training ML-based NER models [6], with AWS, GCP, and Microsoft Azure being the most widely used [29]. These platforms support the full spectrum of services needed for ML projects, including data storage, processing, and AI model development. Microsoft Azure [36], in particular, is well-suited for integrating cloud computing into NER applications, as it supports the entire machine learning lifecycle. Since AWS was already used for ML-based NER training in the CIE project [15], this study focuses on Azure to diversify platform usage and leverage its integrated tools and enterprise-level support. An additional emphasis is placed on the accessibility of Azure’s services for medical professionals, ensuring user-friendly tools and seamless access. While the platform offers cost transparency, cost optimization is not a primary concern in this study. The third remaining challenge (RC3) is to enable medical experts to manage Azure cloud resources and their associated costs. As part of AI4H3 [37], KlinSH-EP, based on KM-EP, was proposed as a knowledge management system for the medical field [13]. KM-EP has been used in various research projects, making it well-suited for managing medical information [7,11,13,38]. It offers a unified user interface and supports essential functions like user and rights management. Integrating cloud-based training features for ML-based NER models into KM-EP would thus be beneficial for medical experts, representing the fourth remaining challenge (RC4).

To evaluate the prototypes developed in this work, a suitable dataset is required. MIMIC-III [39] is a comprehensive and freely available database containing anonymized health-related data from more than 40,000 patients. It includes various types of information, such as laboratory results, procedures, medications, and caregiver notes. However, since the corpus is not available in an annotated form for NER, it is not suitable for the experiments in this study. A suitable annotated dataset in the medical domain is the BioNLP Shared Task 2011 Supporting Task-Entity Relation (REL) dataset [40], which was created as part of the BioNLP Shared Task 2011 and contains annotated texts for various biomedical entities and relationships. However, the dataset is not provided in a standardized format, making its use for NER experiments more complicated compared to other datasets. The Colorado Richly Annotated Full Text (CRAFT) corpus [41] is available online in standardized formats and comprises 97 medical articles from PubMed with over 760,000 tokens, each meticulously annotated to highlight entities such as genes, proteins, cells, cellular components, biological sequences, and organisms [41]. Version 5 of the CRAFT corpus includes semantic annotations derived from Open Biomedical Ontologies (OBOs) and is divided into various modules, with this work focusing on the Chemical Entities of Biological Interest (CHEBI) module [41]. Furrer et al. trained an NER model using the CRAFT-CHEBI corpus and applied various methods [42], evaluating the models with the OGER system and conducting a comparative analysis of different BiLSTM and Transformer configurations, including BERT [43] transformers. Their research provides a valuable benchmark for comparison in this work, with the results of NER training on the CRAFT-CHEBI corpus presented in Table 1. Numerous models from HuggingFace Transformers excel in NER tasks. The first is BERT [43], which generates deep bidirectional representations from unlabeled text by considering both left and right contexts. This pre-training enables straightforward fine-tuning with an additional output layer, leading to high performance on various tasks, including NER, question answering, and language inference [43]. RoBERTa [44] builds on BERT with key differences, such as longer training durations, larger batch sizes, and the removal of the next-sentence prediction objective, allowing for longer text sequences [44]. GPT-2 [45] is pre-trained on a large dataset of English text in a self-supervised manner, predicting the next word in a sentence based on continuous text sequences. Finally, ClinicalBERT [46] is a variant of BERT tailored for clinical text, leveraging its bidirectional transformer architecture and trained on a large corpus of clinical notes [46]. All these established Transformer models are easily accessible, well documented, and have been utilized in comparable studies, such as those by Furrer et al. [42]. For these reasons, they were selected for the evaluation experiments in this work.

In this chapter, the current state of the art relevant to this work was critically reviewed. Four remaining challenges were identified. To assist medical experts in creating high-quality training data for ML-based NER models, the Prepare4NER system will be integrated (RC1). Subsequently, medical professionals will be enabled to configure spaCy and HuggingFace transformer models for training with the generated data (RC2). Then, Azure cloud resources will be configured and utilized for the training (RC3). The knowledge management system KM-EP provides a suitable basis for medical data. Therefore, functions for ML-based NER model training by medical experts should be integrated into the system (RC4). The following chapter will describe in detail the creation of suitable models that address all identified remaining challenges (RC1–4).

3. Conceptual Modeling and Design

After conducting a thorough examination of the current state of the art in science and technology, this section focuses on the theory-building phase and aims to address RO2, “The creation of fundamental models for a system that allows medical experts to develop NER models in the Azure Cloud”. This is achieved by taking into account the RCs discussed in Section 2, reflecting the latest advancements in science and technology. The User-Centered System Design (UCSD) method [47] is used for design and conceptual modeling, while Unified Modeling Language (UML) [48] is used as the specification language.

The application context of this work is embedded in the use cases of FIT4NER [14] and CIE [6], as illustrated in Figure 1. FIT4NER use cases are highlighted in green, while CIE use cases are shown in blue. The stereotype defined in FIT4NER, called Model Definition User, is used, which can be an expert in the relevant knowledge domain but lacks technical knowledge in NER model training [14]. The initial focus is on the FIT4NER use case “UC1.1 Support Data Preparation”, which leads to the development of a system called “Prepare4NER” [32]. Figure 1 illustrates the individual use cases of Prepare4NER in gray. This includes the use cases “UC1.1.1 Visualize Data”, “UC1.1.2 Clean Data”, “UC1.1.3 Annotate Data”, and “UC1.1.4 Split Data”. In the use case of “UC1.1 Support Data Preparation”, various document formats are converted to the BioC format [27]. Data can be visualized using the use case “UC1.1.1 Visualize Data” and cleaned in the “UC1.1.2 Clean Data” use case. In the “UC1.1.3 Annotate Data” use case, domain experts can perform domain-specific annotations or correct existing annotations. The “UC1.1.4 Split Data” use case involves splitting the data into training and testing documents. Additionally, the use case “UC1.2.1.1 Store in Cloud” of CIE [6] is extended to make the data available for model training in the cloud. Another system that focuses on model training in the Azure Cloud using the spaCy framework is called “spaCy NER at Azure Cloud (spaCNAC)”. The main use cases in orange in Figure 1 are “UC1.2.4 Select Pipeline Parameters” and “UC 1.2 Support Model Training”, with the specialization “UC1.2.2 Support Transformer-based Model Training” to allow fine-tuning of transformer models. The use case “UC1.2.2.2.1 Select Pipeline Parameters” supports domain experts in configuring the ML training pipeline. For the training of the models, the “UC1.2.3 Train in Cloud” use case is responsible, which also includes support for domain experts in managing cloud resources in the “UC1.2.3.1 Manage Cloud Resources” use case. The trained models are ultimately stored in the cloud for further use in the respective applications, in the use cases “UC1.2.1 Store Model” and “UC1.2.1.1 Store in Cloud”.

Figure 2 illustrates the primary components of Prepare4NER, divided into a frontend and a backend. For an in-depth explanation of the Prepare4NER theory building, refer to [32]. The Data Preparation View (Figure 2) is the user interface integrated into KM-EP for managing and annotating datasets. The main tasks of the Data Preparation Subsystem include converting various document formats into the BioC format to establish a uniform data foundation and performing data cleaning to correct inconsistent or erroneous data. Additionally, it facilitates manual annotation by domain experts to label specific NEs for NER model training. To ensure good maintainability, all its components are designed uniformly. They implement an HTTP interface and adhere to a layered architecture. The Data Preparation Controller component delegates requests to two subsystems: the Document-import Subsystem and the Corpus-import Subsystem (Figure 2). The Document-import Subsystem is responsible for parsing and converting documents from various formats into the BioC format, while the Corpus-import Subsystem handles similar tasks for existing corpora. The Controller components, DocumentImporterController and CorpusImporterController, employ the Strategy Pattern [49] (p. 315) in the form of a REST interface to ensure the interchangeability of components as microservices. Thus, extending support for new data formats (such as spaCy JSON, BRAT, etc.) for NER frameworks is limited to the implementation and registration of the corresponding import methods as microservices. Furthermore, the Data Preparation Controller utilizes the Dataset Storage component to store prepared data for further processing. The Data Provision Subsystem (Figure 2) manages the conversion of annotated BioC documents into annotation formats compatible with various NER frameworks. It provides the DataProvisionIF interface, enabling NER model training services such as spaCNAC to consume the annotated documents. As new frameworks emerge, the need to support additional formats may arise. This subsystem mirrors the structure of the Document-import and Corpus-import Subsystems, enabling seamless integration of new strategies through the implementation of converters for these formats. The Dataset Storage Subsystem consists of the StorageController, which is responsible for the management of datasets and documents, their BioC representation, as well as the storage and management of NE label collections and their labels (Figure 2). The DatasetSplitter Component is responsible for dataset splitting, and the DataExportController provides the annotated data to the DataProvision Controller.

The spaCNAC components (Figure 3) are divided into two microservices: the frontend and the backend. The Model Definition Manager View (Figure 3) serves as a central interface for managing model definitions, which set the configuration for NER model training. The view allows users to create new model definitions, modify existing ones, initiate training, or remove model definitions. When creating a model definition, the view guides the user through a step-by-step process. This process includes the selection of a prepared corpus with the Corpus UI component, the setup of the model architecture and hyperparameters with the spaCy Configuration UI component, and the configuration of Azure cloud resources with the Azure Cloud Resources UI component. The spaCy Azure NER Service Subsystem (Figure 3) serves as the backend for the Model Definition Manager View and contains the business logic of the system. The service receives all user interactions through a REST API, processes them with the corresponding manager components (Corpus Manager, NER Model Definition Manager, Cloud Training Config Manager), and returns the results to the view. After the completion of the step-by-step process, the model definition is sent to the NER Model Deployment Service component, which validates the model definition and transmits the configuration to the Azure ML Pipeline Subsystem (Figure 3). There, the Cloud Deployment Manager component generates the required cloud resources using the Storage Service and Compute Service components, as well as an Azure ML pipeline that contains the model definition and configuration for training an NER model. The Model Manager Controller Subsystem (Figure 3) uses the NER Model Deployment Service to retrieve the trained NER model along with the metadata generated during training, such as resource utilization or metrics, and store them in the Model Registry Subsystem (Figure 3). Additionally, the Model Manager Controller Subsystem serves as the backend for the Model Manager View (Figure 3), which collectively manages all pipelines and NER models. The Model Manager View is responsible for maintaining a comprehensive inventory of all trained NER models and displays all pipelines trained by the NER system. This includes information such as current status, initiation and completion times, total runtime in minutes, model description, and pipeline description.

Building upon the CIE [6] and FIT4NER [14] general architectures, an integrated architecture is developed that incorporates the Prepare4NER and spaCNAC components. The architecture is depicted in Figure 4, with the Prepare4NER components shown in gray and the spaCNAC components in orange. The architecture follows a layered structure based on the Model-View-Controller (MVC) pattern [50], where the user interfaces, business logic, and data storage are separated into distinct components. The Data Preparation View, Data Preparation Controller, and Dataset Storage components implement the “UC1.1 Support Data Preparation” use case and its associated use cases. The Data Provision Controller component facilitates data provisioning for training using spaCy in the Azure Cloud, which is performed by the spaCy Azure NER Service component. This is supported by another component called the Azure ML Pipeline, which runs in the Azure Cloud and performs model training. The component Model Definition Manager View handles the “UC1.2.4 Select Pipeline Parameter” and “UC1.2.2.1 Select Hyperparameter” use cases. After training, the models are stored using the Model Registry component and can be retrieved from there for further fine-tuning. The components Model Manager Controller and Model Manager View are responsible for managing and displaying the models.

With the description of the theoretical structure in this section, RO2 has been achieved. The following section discusses the prototypical implementation of the developed components and their integration.

4. Implementation

Building on the theoretical framework established in Section 3, this section delves into the system development of the Prepare4NER and spaCNAC systems. It addresses RO3: The development of a prototype system to assist medical experts in creating NER models in Azure. The chapter begins by describing the implementation of the Prepare4NER components. It delves into the functionality of its components and discusses the technologies, frameworks, and programming languages employed in their development. It also addresses architectural design, dependencies, interfaces, and implementation details, including any specific challenges encountered during the development process. Next, the implementation of the spaCNAC components is explored, following a structure similar to Prepare4NER. The functionality, technologies, implementation details, and challenges specific to each component are described. Subsequently, an outline is given of the integration of the Prepare4NER and spaCNAC systems. The chapter concludes with a comprehensive summary of the effectiveness of the implementation in supporting medical experts and achieving RO3.

Prepare4NER focuses on text annotation and supports different NER frameworks’ data formats. The Prepare4NER components are implemented as separate microservices using Java, Python, Spring Boot, and FastAPI. These microservices feature a uniform design, HTTP interface, and layered architecture to ensure maintainability. The Data Preparation Controller converts documents to the BioC format, cleans data, and enables expert annotation for NER model training. It provides a Spring Boot @RestController, which delegates requests to the DocumentImporter Controller and CorpusImporter Controller (Figure 2). These controllers employ the Strategy Pattern [49] (p. 315), enabling flexible and extensible data processing. For task delegation, the Data Preparation Controller implements a client class for each addressed controller, using Spring RestTemplate to generate HTTP requests. The endpoints of the subsystems are managed within the configuration of the Data Preparation Controller. The StorageController component manages datasets, documents, their BioC representation, and the storage of NE label collections and their labels (Figure 2). Its BioCDataset Storage sub-component uses a LocalStorageProvider class to store BioC collection documents in a configurable folder on a local drive. This architecture supports the implementation of different storage provider classes, enabling the use of alternative storage solutions such as cloud storage. Datasets and documents are managed using Dataset Descriptor and Document Descriptor objects, which are stored in a dedicated database. These descriptor objects contain only metadata about the datasets and documents. The NELabelStorage sub-component manages the NE label collections used for data annotation. An NE label collection consists of metadata and a list of NE labels. The DatasetSplitter component, implemented in Python and utilizing the scikit-learn library [51], is also part of this subsystem. It supports various split strategies, such as Random Split, KFold, and StratifiedKFold [51], by interacting directly with BioC documents. The DataProvision Controller (Figure 2) is responsible for converting annotated BioC documents into the formats required by different NER frameworks. This is completed using its DataConversion Controller sub-component, which applies the Strategy Pattern [49] (p. 315). As new NER frameworks become available, the system can easily support them by adding new conversion strategies. Currently, two converters are available: the CoNLL2003Converter, which transforms data into the CoNLL-2003 format [26] used by frameworks such as spaCy [35] and Huggingface Transformers [34], and the CoreNLPConverter, which converts data into the Stanford CoreNLP format [52]. Once converted, the Dataset Export Controller distributes the data to various storage options, such as local drives or cloud services. To handle large datasets and ensure that the system remains scalable, Prepare4NER should be deployed in the cloud. The Azure Kubernetes Service [53] offers features such as autoscaling and load balancing, which allow the application to dynamically adjust resources according to demand. However, some challenges may arise when processing very large datasets in real time. To reduce delays, efficient database solutions and caching mechanisms should be used. Additionally, when many users access the system at once (high concurrency), server overloads may occur. In such cases, horizontal scaling and the Azure Load Balancer can help distribute traffic evenly. Future work includes experiments to test scalability under different conditions and to develop strategies to address these challenges. The DataPreparation UI is integrated into KM-EP and implemented using PHP Symfony controllers and Twig templates. The additional DocumentEditor UI component, developed independently from KM-EP using the Angular framework in TypeScript, enables domain experts to annotate texts with predefined NE labels. This component also supports nested and overlapping annotations. Originally from a previous project, its integration into KM-EP is loosely achieved through a custom pop-up window. For a detailed description of the Prepare4NER implementation, refer to [32].

spaCNAC empowers domain experts to develop NER models using spaCy in the Azure Cloud, leveraging the data created with Prepare4NER. The frontend is developed using ReactJS, HTML, and TypeScript, while the backend is implemented in Python and FastAPI. The backend interacts with Azure Cloud Services. Both the frontend and the backend are packaged in Docker containers. The Model Definition Manager View (Figure 3) is built on React, integrates data from the spaCy Azure NER Service, and enables navigation between different sub-views for creating or modifying model definitions. The view takes the user through a step-by-step process when creating a model definition. In the first step, the user selects a prepared corpus that has been created, converted into the spaCy format, and divided into training and testing data using Prepare4NER. These data are then uploaded to an Azure Blob container. In the next steps, users set up the spaCy pipeline, choosing from various model types (Figure 5) such as “Text to Vector Transformer”, “Named Entity Recognizer”, and “HuggingFace Transformers”. Specific options for embedding and encoder models are available, such as “spacy.MultiHashEmbed.v2” or “spacy.MaxoutWindowEncoder.v2”. Notably, the system provides an overview of available HuggingFace Transformers models to simplify the selection of a pre-trained model for NER training. Users can adjust hyperparameters that significantly influence the model’s performance through a dedicated section. This allows for the modification of six key hyperparameters. For experienced users, there is also an option to make advanced adjustments directly in the spaCy configuration file. After configuration, users select the cloud resources for training, with only compute clusters currently available. A tabular overview facilitates this selection by listing specifications and costs per hour for different clusters, along with sorting and filtering options. The clusters are categorized into three color-coded groups to aid in budget- and requirement-based selection (Figure 6):

Green: Costing less than $3 per hour, an economical choice
Yellow: Mid-range option between $3 and $10 per hour
Red: High-end selection that costs more than $10 per hour

Once a suitable compute cluster is selected, the user can initiate the training process. The parameters chosen by the user for the training process are saved by the NER Model Definition Manager in a MySQL database and then passed to the Cloud Deployment Manager via the NER Model Deployment Service (Figure 3). The Cloud Deployment Manager provisions an Azure ML pipeline and the necessary cloud resources in an Azure workspace for the training process, such as Azure Blob Storage. This requires three key input components: a spaCy configuration file, the training corpus, and the test corpus. After successfully training the NER model, the spaCy models are stored in the Storage Service (Azure Blob Storage), designated as the output destination within the pipeline. To monitor the model training process, bidirectional communication is established between the Model Definition Manager View and the Azure ML pipeline using WebSockets. This allows for tracking of key performance indicators such as CPU usage, read/write operations, and disk utilization. Upon completion of the training, the trained NER model is stored in the NER Model Registry.

The NER Model UI (Figure 7) maintains an inventory of all trained NER models in the NER Model Registry and displays pipeline details, including status, initiation and completion times, runtime, and descriptions. Specifically, the model description contains the metrics of the trained NER models, including Precision (P), Recall (R), and F-Score (F1) [54]. The system allows users to download training results and trained models. Furthermore, it provides detailed descriptions of individual pipelines along with Key Performance Indicators (KPIs) for monitoring resource usage, identifying potential bottlenecks, and optimizing overall system performance. These KPIs include:

CPU utilization: The proportion of total CPU resources consumed by a specific process or collection of processes. Regular monitoring helps identify bottlenecks and optimize resource allocation.
CPU memory utilization: Percentage of total memory resources consumed by one or more processes. High utilization may indicate optimization needs.
Disk read/write operations: Quantifies the speed at which data is read from or written to the disk, which is crucial for identifying potential bottlenecks.
Disk usage: Amount of disk space consumed by a process or group of processes. Regular monitoring is important to ensure adequate availability and to avoid issues related to disk space exhaustion.

By utilizing these performance metrics, users can gain a comprehensive understanding of the cloud resources consumed during the NER model training process. These insights aid in future decision-making regarding model training exercises by improving pipeline efficiency and optimizing resource allocation, thus enabling more effective and efficient resource utilization.

Next, the implementation of the integrated overall system, combining the KM-EP [13], Prepare4NER [32], and spaCNAC systems, is summarized. KM-EP was developed using Symfony and PHP and provides basic functions such as user management and document management. It allows the integration of additional systems through HTTP and REST interfaces. The microservice architectures of Prepare4NER and spaCNAC allow seamless integration into KM-EP. The components were brought together into a coherent system using Docker Compose, which simplifies configuration and deployment in containers, increasing scalability and portability. The integration of Prepare4NER into KM-EP was carried out using Symfony controllers and Twig templates. The controllers process user actions and manipulate the data model. KM-EP forwards requests to the corresponding microservices. The designed data models are implemented as PHP classes and are passed to templates after subsystem calls, which visualize the data. In KM-EP, Prepare4NER is accessible through the navigation bar, while the DocumentEditorUI is opened in a new browser tab. spaCNAC uses Python and the FastAPI framework for backend operations, and TypeScript with React for a reactive frontend. The integration of spaCNAC into KM-EP is carried out using an iframe to seamlessly embed spaCNAC into KM-EP, as different technologies are used. In KM-EP, spaCNAC is accessible through the KM-EP navigation menu and the entry “Named Entity Recognition spaCy-Azure”, which provides access to the main functions of the system. A distinction is made between model definition and model manager. The “spaCy NER Model Definition Manager” is integrated into the model definition page, and the “spaCy NER Model Manager” is integrated into the model manager page of KM-EP. To transfer data from Prepare4NER to spaCNAC, it was necessary to develop a new converter that transforms data from the BioC format to the spaCy JSON format. This shows that extending the system to accommodate new NER frameworks is straightforward using the Strategy Pattern [49] (p. 315). The source code for the converter is shown in Listing 1. First, it is checked whether an empty document is present. The BioC document is read, and a spaCy-compatible object is created. The existing annotations are stored in a list of entities within the object after a validity check. The document is then serialized into JSON and returned. The prototype captures, logs, and reports errors that occur, such as those due to invalid or incomplete data, as a RuntimeException. For production use, additional edge cases should be covered, and the behavior should be verified through unit tests. Very large documents may also cause problems due to high memory usage. This enables the export of the Prepare4NER datasets to spaCNAC, as illustrated in Figure 8.

Listing 1. Java Implementation of the SpacyJsonConverter Class

@Service

public class SpacyJsonConverter {

  private static final Logger logger = LoggerFactory.getLogger(SpacyJsonConverter.class);

  public byte[] convert(String documentXml) {

    if (documentXml == null || documentXml.trim().isEmpty()) {

      throw new IllegalArgumentException("Input XML must not be null or empty.");

    try {

      // Parse the input BioC XML into a BioCDocument

      BioCDocumentReader bioCDocumentReader = new BioCDocumentReader(new StringReader(documentXml));

      BioCDocument bioCDocument = bioCDocumentReader.readDocument();

      // Extract plain text and annotation entities from the BioCDocument

      String text = extractText(bioCDocument);

      List<Annotation> annotations = extractAnnotations(bioCDocument);

      // Initialize spaCy-compatible document

      SpacyDocument spacyDocument = new SpacyDocument(text);

      for (Annotation annotation : annotations) {

        int begin = annotation.getBeginIndex();

        int end = annotation.getEndIndex();

        // Validate annotation offsets

        if (begin >= 0 && end <= text.length() && begin < end) {

          spacyDocument.entities.add(new Object[]{begin, end, annotation.getLabel()});

        } else {

          logger.warn("Skipping invalid annotation: {}", annotation);

      // Wrap in a list as spaCy expects a list of documents

      List<SpacyDocument> spacyDocuments = new ArrayList<>();

      spacyDocuments.add(spacyDocument);

      // Convert Java object to JSON string with pretty printing

      ObjectMapper mapper = new ObjectMapper();

      String spacyJsonString = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(spacyDocuments);

      // Return UTF-8 encoded byte array

      return spacyJsonString.getBytes(StandardCharsets.UTF_8);

    } catch (Exception e) {

      logger.error("Failed to convert BioC document", e);

      throw new RuntimeException("Error converting BioC XML to spaCy JSON", e);

In this chapter, the implementation of the Prepare4NER and spaCNAC systems, as well as their integration into KM-EP, has been detailed, successfully completing RO3. The next chapter will now present experiments that will verify whether the created systems meet the set requirements.

5. Evaluation

The previous chapter described the implementation of the two prototypes, Prepare4NER and spaCNAC, as well as their integration. This chapter focuses on the experimental goal of RO4: The execution and analysis of quantitative and qualitative evaluations of the developed system. Qualitative and quantitative evaluation methods were employed. For qualitative experiments, two different approaches were used: the Cognitive Walkthrough (CW) [55], a method for evaluating user interfaces, and the Software Technical Review (STR) process, an approach to improving software quality that is also established as an IEEE standard (IEEE Std 1028-2008) [56]. A CW involves simulating and analyzing the cognitive processes of a user interacting with the user interface. Initially, tasks with specific goals and required action sequences are defined. Subsequently, an expert panel evaluates the expected user interaction, considering criteria such as action availability and labeling, the likelihood of correct action selection, and action complexity. In contrast, the STR process is a technique to analyze the technical aspects of a software product. A lead engineer organizes a review meeting and provides structured inputs, including information such as the software product, review objectives, planned functionality, and a description of the review process. During the review meeting, a group of experts examines the software product and documents all findings and anomalies. The quantitative experiments were evaluated using the metrics P, R, and F1 [57]. Precision measures how accurately users have selected the correct items. Recall expresses completeness, and the F1-Score combines P and R to measure the accuracy of the results. The chapter begins by detailing the experiments conducted with Prepare4NER. Then, it outlines the experiments performed with the spaCNAC prototypes. This is followed by an overview of the experiments that involve the integration of the two systems. The chapter concludes with a summary of the results in relation to achieving RO4.

For the initial CW evaluation of Prepare4NER [32], several meetings were held, involving a doctor, two Ph.D. students, and a bachelor’s student, all conducting research in the fields of Natural Language Processing (NLP) and NER. Tasks were prepared for these meetings, covering the entire workflow from importing, annotating, and splitting documents in the data preparation phase to exporting datasets for a specific NER framework and storage providers in the data provisioning phase. During the evaluation, usability insights were discussed and documented, resulting in a total of 12 findings. A key insight was that additional instructions for users should be introduced. The participants of the CW sessions recognized the need for instructions or help texts for several tasks to assist users in performing them more effectively. Similarly, the creation of label collections and labels should be simplified. Furthermore, the support for additional annotation features was emphasized, as the current state of the annotation tool does not support discontinuous, nested, or overlapping annotations. Despite these insights, which are intended for future improvements, the system was deemed suitable for the intended tasks. In addition to the initial CW, Prepare4NER was also assessed using a STR in a single review meeting. This evaluation involved five sets of inputs and included a PhD student with expertise in modeling microservice-based architectures and a bachelor’s student in computer science. The outcome of the STR confirmed that Prepare4NER and its architecture are suitable for the intended purpose. In particular, the strategy pattern facilitates the intended extension of the system for future document formats, NER frameworks, and storage solutions. The insights from the STR session mainly suggest improvements such as better documentation of the HTTP interface of the strategy components to make it easier for system developers to implement new components. In addition, there is a proposal for a self-registering approach for new components to avoid the need to change the system configuration when new strategies are added. Another system improvement would be to allow system developers to test new components before integrating them into the system.

spaCNAC was evaluated both quantitatively and qualitatively. First, a quantitative analysis was performed, followed by a qualitative one. Each evaluation description begins with an explanation of the evaluation process and concludes with an in-depth analysis of the results. The CRAFT-CHEBI corpus was selected, as described in Section 2, because it is already annotated in standardized data formats for NER. Future work can extend quantitative experiments to additional corpora, such as the BioNLP Shared Task 2011 Supporting Task-Entity Relation (REL) dataset [40]. The quantitative evaluation of spaCNAC was conducted using the CHEBI annotations of the CRAFT corpus and the metrics P, R, and F1. The CRAFT corpus with CHEBI annotations was converted to the BRAT format [58] using Clojure Boot according to the official documentation [59] and then imported via the Prepare4NER import function. The split ratio for training and testing data was set in Prepare4NER so that 100% of the data was used for training, as spaCNAC later splits the dataset into training and testing data itself. The dataset was then exported for spaCNAC using the SpacyJsonConverter (see Listing 1, Figure 8). Since Prepare4NER exports the dataset as individual documents, they had to be consolidated. The JSON documents were combined into a list using jq and saved in a single file (see Listing 2). Then, this file was imported into spaCNAC. Within spaCNAC, the Azure compute instances shown in Table 2 were used for training, some of which offer GPU support. Three quantitative experiments were conducted using four different Transformer models from HuggingFace, as selected in Section 2: BERT [43], RoBERTa [44], GPT-2 [45], and ClinicalBERT [46]. The models were trained on the CRAFT CHEBI corpus and partially compared with the work of Ferrur et al. [42], who also used BERT Transformer models for the NER task.

Listing 2. Consolidation of CRAFT-CHEBI Documents for spaCNAC using jq

# The command uses jq to consolidate multiple JSON files into a single array.

# The ’-s’ option reads all input files into an array.

# ’[.[][]]’ accesses the contents of each JSON file and combines them into a single output.

jq -s ’[.[][]]’ prepare4ner-localstorage/undefined/* \

    > craft-chebi5.0.2-spacy.json

# The resulting consolidated JSON array is saved to ’craft-chebi5.0.2-spacy.json’.

In the first experiment, the impact of computational resources on the performance of spaCy NER models was investigated. The models were trained using the CRAFT corpus on three Azure CPU machines with different specifications: Standard_A2_v2, Standard_D11_v2, and Standard_A2m_v2. The results showed that training on the Standard_A2_v2 machine was aborted due to insufficient memory, while the other two machines successfully completed the training. The models trained on the Standard_D11_v2 and Standard_A2m_v2 machines exhibited significant performance differences. The model trained on the Standard_A2m_v2 machine achieved higher Precision, Recall, and F-Score values than the model trained on the Standard_D11_v2 machine, although it took twice as long to train. This suggests that more powerful computational resources can lead to better NER models, albeit requiring more time. The results are shown in Table 3.

In the second experiment, the HuggingFace Transformers models BERT, GPT-2, ClinicalBERT, and RoBERTa were used, along with the Standard_NC4as_T4_v3 Azure-VM for training. Models 1 and 2 (BERT and GPT-2) achieved a Precision of 0.84, with GPT-2 showing a higher Recall and F-Score. Despite the longest training duration (129 min), ClinicalBERT achieved the highest precision (0.86), although its recall was only slightly better than BERT’s and lower than GPT-2. RoBERTa stood out: despite the shortest training duration (93 min), it achieved a high Precision of 0.85, the highest Recall (0.82), and the highest F-Score (0.83). This suggests that RoBERTa offers a more balanced performance and efficient training time. Comparing the models used in this study (BERT, GPT-2, ClinicalBERT, and RoBERTa) with those from the referenced research [42] (OGER, BiLSTM, BERT-IDs, BERT-Spans+OGER, and BERT-IDs+Spans+OGER), it is clear that the models in this work are superior. Every model evaluated in this experiment exceeds the performance of the best model from the research study, BERT-Spans+OGER, which integrates IOBES predictions with OGER annotations in a pipeline fashion. In summary, RoBERTa appears to be the most efficient and balanced model based on these metrics, although all models performed relatively well. However, the choice of model still depends on the specific requirements of a given task. Table 4 presents the results of the model training. After extensive research, no comparable studies were found that address the training of the CRAFT-CHEBI corpus in the cloud. However, the overarching CIE approach has been utilized to train the GENERMED dataset in the AWS and Azure clouds [6]. Although the results are not directly comparable, it is evident that training times are at a similar level. Since spaCNAC and prepare4NER are embedded in the FIT4NER [14] and CIE [6] projects, the underlying architectures also facilitate transfer to other domains of knowledge, such as history or applied games [11], as well as the use of other cloud providers such as GCP or AWS. This flexibility fundamentally allows for the avoidance of vendor lock-in.

In the third and final experiment, the learning rate hyperparameter used for training with the RoBERTa model from the HuggingFace library was refined. Three learning rates were tested:

3 \times 10^{- 5}, 5 \times 10^{- 5}, and 1 \times 10^{- 4}

The results indicate that the learning rate significantly impacts the Precision, Recall, and F-Score of the NER models with RoBERTa. The highest Precision was achieved with a learning rate of

5 \times 10^{- 5}

The best balance between Precision and Recall, as indicated by the F-Score, was achieved with a learning rate of

1 \times 10^{- 4}

This learning rate was also the most efficient, measured by task duration. Comparing the results of this experiment with those from the research [42], the RoBERTa model with a learning rate of

1 \times 10^{- 4}

outperforms all other models. The highest F-Score achieved in this experiment, 0.8337, exceeds the best F-Score of 0.7700 from other studies, which was achieved by the BERT IDs+spans+OGER model. Even the RoBERTa model with the lowest F-Score in this experiment (at a learning rate of

3 \times 10^{- 5}

) surpasses all the other models in the research. Thus, the RoBERTa model, with its adjustable learning rate, is superior. The results are shown in Table 5.

Following the presentation of the three quantitative experiments, the results of the qualitative evaluation will be presented next. spaCNAC aims to facilitate the development of NER models by leveraging spaCy, HuggingFace Transformers, and Azure cloud services within KM-EP. The primary objective of this qualitative evaluation is to confirm user-friendliness and performance. This will be achieved through detailed user feedback on the prototype’s features and functionality via a CW [55]. It is essential to capture users’ perceptions of whether they find it straightforward to train new NER models, apply Azure cloud services, and evaluate the effectiveness of the prototype’s KM-EP integration. It is important to highlight that the system is designed to support diverse user groups, including medical experts and NER specialists.

The CW session was meticulously prepared. First, tasks to be performed during the CW session were defined. These tasks, listed in Table 6, cover areas such as NER model definition, training, and management. Participants in the walkthrough session represented various user groups, including a Senior Data Scientist and PostDoc, a Data Scientist who is also pursuing a PhD, and two Senior Software Engineers. Despite their backgrounds as data scientists and NLP experts, participants were encouraged to adopt the perspective of naïve or domain users. This approach was chosen to ensure that the feedback received is as objective as possible, providing an overview of the prototype’s usability, feasibility, effectiveness, and efficiency. To ensure that all participants have a clear understanding of the project’s context, the central ideas and concepts are presented at the beginning. Participants were then guided through the application, gaining insights into the system’s functionality and design considerations.

During the evaluation, usability findings were discussed and documented, resulting in a total of seven findings (Table 7). Initially, participants did not understand the purpose and function of the “Edit NER Model Definitions” page (CWF1). A potential solution would be to integrate a detailed description of model definitions into the user interface. The participants were also uncertain about the structure and use of the datasets provided by the system (CWF2). A detailed description of these datasets, including their format, would help users utilize them more effectively. Regarding the language selection for NER model training, it was suggested to optimize the process by automatically filling the language field in Step 2 based on the corpus selected in Step 1 (CWF3). Additionally, participants found the navigation bar confusing due to overloaded step descriptions (CWF4). Shorter and more precise step descriptions, along with a visual representation of the steps, would improve clarity. The layout of the NER Model Definitions page was considered too cluttered by users (CWF5). They had difficulties with the functionality of the “Save” and “Cancel” options during the editing process and found navigation between steps unclear. Suggested improvements include introducing “Next” and “Back” buttons for seamless navigation, as well as a clear display of the overall saving status during the editing process. Overall, users felt that the amount of information presented was sometimes overwhelming (CWF6). A concise and clear presentation of the necessary information could help balance providing sufficient data and avoiding information overload. Lastly, users desired a notification popup to confirm when the data was successfully saved (CWF7). Based on the feedback received, several adjustments were made to the system. For CWF1, clear descriptions were added for both the Model Manager and Model Definitions in their respective UI areas. The information on the provided datasets was enhanced by specifying the data type and language of each dataset (CWF2). This update also enables automatic filling of the language field based on the selected dataset, as noted in CWF3. Navigation was improved by relocating the step descriptions to their respective pages, as suggested in CWF4. Finally, in response to CWF7, a notification feature was added that activates each time the “Save” button is clicked. The issues identified in CWF5 and CWF6 remain open and are planned for future improvements. Overall, users provided positive feedback on the tasks performed.

A total of three distinct quantitative and qualitative experiments were conducted for the evaluation of Prepare4NER and spaCNAC. The detailed evaluation of Prepare4NER is described in [32]. The qualitative experiments identified several usability improvement opportunities for the systems, some of which have already been implemented. Overall, the systems were deemed highly effective for assisting medical experts in creating ML-based NER models in the Azure Cloud. The quantitative experiments also demonstrated that Cloud-based training produces NER models with performance at a state-of-the-art level. In summary, across all experiments, RO4 was achieved.

6. Conclusions

This article introduced the systems Prepare4NER and spaCNAC, which aim to support medical experts in the cloud-based development of ML-based NER models. The structured Nunamaker methodology [19] for information system development was employed to create the systems and integrate them into the KMS KM-EP. Section 1 provided an overview of the topic and the research context as a foundation for the subsequent analysis. Section 2 conducted a comprehensive review of the current state of the art, identified RCs faced by medical experts in cloud-based NER model training, and derived the RCs addressed in this work, thereby fulfilling the observation ROs. Section 3 presented and discussed design and conceptual models to support medical experts in training ML-based NER models in the cloud, completing the theory-building ROs and laying the groundwork for Section 4. This chapter described the development of the prototype systems from the proposed models and demonstrated their integration into the KM-EP system, thus achieving the system development ROs. Section 5 evaluated the extent to which the developed prototypes addressed the RQs. Quantitative and qualitative evaluation experiments showed that Prepare4NER and spaCNAC effectively support medical experts in cloud-based NER model training, thereby achieving the experimentation ROs. During these experiments, areas for future work were identified, including improving the layout of the spaCNAC NER Model Definition page (CWF5) and generally reducing the amount of information on the user interface (CWF6). The quantitative results can be further optimized by experimenting with additional hyperparameter adjustments for model training. The system should also be evaluated more broadly by incorporating additional datasets, such as the BioNLP Shared Task 2011 Supporting Task-Entity Relation (REL) dataset [40], as well as other Transformer models. For productive deployment under real-world conditions, it is essential to provision the system in the cloud using Kubernetes to ensure scalability for numerous simultaneous user requests and large data volumes. This also includes optimizing system automation, Azure resource allocation, and evaluating cost efficiency in Azure. On the functional side, the system should be enhanced with specific features for medical experts, such as direct support for medical data formats such as FHIR. Subsequently, the system should be reevaluated with medical experts to ensure that these domain specialists are effectively supported. Despite these open challenges for future work, this work successfully addressed the defined RQs and solved the RCs identified in Section 2. The results highlight the prototype’s potential to enhance the NERs model training process for medical professionals and indicate opportunities for further refinement. Future research will aim to address the identified improvement points and thereby expand the system’s capabilities and applicability. The contributions of this work lay a solid foundation for advancing the field of NERs in medical informatics.

Author Contributions

Conceptualization, P.T., F.F. and M.H.; methodology, P.T., F.F. and M.H.; software, P.T. and F.F.; validation, P.T., F.F. and M.H.; formal analysis, P.T., F.F. and M.H.; investigation, M.H.; resources, P.T., F.F. and M.H.; data curation, P.T., F.F. and M.H.; writing—original draft preparation, P.T. and F.F.; writing—review and editing, M.H.; visualization, P.T., F.F. and M.H.; supervision, M.H.; project administration, F.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in UCDenver-ccp/CRAFT at https://github.com/UCDenver-ccp/CRAFT (accessed on 8 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Steinberg, E.; Greenfield, S.; Wolman, D.M.; Mancher, M.; Graham, R. Clinical Practice Guidelines We Can Trust; National Academies Press (US): Washington, DC, USA, 2011. [Google Scholar]
Freund, F.; Tamla, P.; Hemmje, M. Towards Improving Clinical Practice Guidelines Through Named Entity Recognition: Model Development and Evaluation. In Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS), Letterkenny, Ireland, 7–8 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
Zhang, S.; Elhadad, N. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. J. Biomed. Inform. 2013, 46, 1088–1098. [Google Scholar] [CrossRef]
Chung, H.W.; Garrette, D.; Tan, K.C.; Riesa, J. Improving multilingual models with language-clustered vocabularies. arXiv 2020, arXiv:2010.12777. [Google Scholar]
Von der Mosel, J.; Trautsch, A.; Herbold, S. On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Trans. Softw. Eng. 2022, 59, 1487–1507. [Google Scholar] [CrossRef]
Tamla, P.; Hartmann, B.; Nguyen, N.; Kramer, C.; Freund, F.; Hemmje, M. CIE: A Cloud-Based Information Extraction System for Named Entity Recognition in AWS, Azure, and Medical Domain. In Proceedings of the Knowledge Discovery, Knowledge Engineering and Knowledge Management; Coenen, F., Fred, A., Aveiro, D., Dietz, J., Bernardino, J., Masciari, E., Filipe, J., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 127–148. [Google Scholar]
Forschungsinstitut für Telekommunikation und Kooperation e.V. (FTK). RecomRatio. Available online: https://www.ftk.de/de/projekte/recomratio (accessed on 8 May 2025).
Hemmje, M. Chair of Multimedia and Internet Applications, 2023. Available online: https://www.fernuni-hagen.de/multimedia-internetanwendungen/en/ (accessed on 8 May 2025).
FTK. FTK e.V. Research Institute for Telecommunications and Cooperation, 2023. Available online: https://www.ftk.de/ (accessed on 8 May 2025).
Nawroth, C. Supporting Information Retrieval of Emerging Knowledge and Argumentation. Ph.D. Thesis, FernUniversität in Hagen, Hagen, Germany, 2020. [Google Scholar]
Tamla, P. Supporting Access to Textual Resources Using Named Entity Recognition and Document Classification. Ph.D. Thesis, University of Hagen, Hagen, Germany, 2022. [Google Scholar]
FTK. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3); R&D White Paper; FTK e.V. Research Institute for Telecommunications and Cooperation: Dortmund, Germany, 2020. [Google Scholar]
Vu, B.; Wu, Y.; Afli, H.; Mc Kevitt, P.; Walsh, P.; Engel, F.; Fuchs, M.; Hemmje, M. A Metagenomic Content and Knowledge Management Ecosystem Platform. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 1–8. [Google Scholar]
Freund, F.; Tamla, P.; Reis, T.; Hemmje, M.; Kevitt, P.M. FIT4NER-Towards a Framework-Independent Toolkit for Named Entity Recognition. In Proceedings of the 8th Collaborative European Research Conference (CERC 2023), Barcelona, Spain, 9–10 June 2023. [Google Scholar]
Hartmann, B.; Tamla, P.; Freund, F.; Hemmje, M. Fine-Tune it Like I’m Five: Supporting Medical Domain Experts in Training NER Models Using Cloud, LLM, and Auto Fine-Tuning. In Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS), Letterkenny, Ireland, 7–8 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2020. [Google Scholar] [CrossRef]
Stieber, S. Implementierung eines Systems für Vorverarbeitung von Daten für Named Entity Recognition in einem Wissensmanagement-System für den medizinischen Bereich, 2023.
Nguyen, N.N.N. Development of an Azure Cloud Resource Management System for Named Entity Recognition Using Transformers and Application in a Medical Knowledge Management System. Master’s Thesis, FernUniversität in Hagen, Hagen, Germany, 2024. [Google Scholar]
Nunamaker, J.F., Jr.; Chen, M.; Purdin, T.D.M. Systems Development in Information Systems Research. J. Manag. Inf. Syst. 1990, 7, 89–106. [Google Scholar] [CrossRef]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A Survey on Named Entity Recognition—Datasets, Tools, and Methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
Meystre, S.M.; Savova, G.K.; Kipper-Schuler, K.C.; Hurdle, J.F. Extracting information from textual documents in the electronic health record: A review of recent research. In Yearbook of Medical Informatics; Georg Thieme Verlag KG: Stuttgart, Germany, 2008; pp. 128–144. [Google Scholar]
Raja, U.; Mitchell, T.; Day, T.; Hardin, J.M. Text mining in healthcare. Applications and opportunities. J. Healthc. Inf. Manag. JHIM 2008, 22, 52–56. [Google Scholar] [PubMed]
Vcelak, P.; Kryl, M.; Kratochvil, M.; Kleckova, J. Identification and Classification of DICOM Files with Burned-in Text Content. Int. J. Med. Inform. 2019, 126, 128–137. [Google Scholar] [CrossRef]
Ayaz, M.; Pasha, M.F.; Alzahrani, M.Y.; Budiarto, R.; Stiawan, D. The Fast Health Interoperability Resources (FHIR) Standard: Systematic Literature Review of Implementations, Applications, Challenges and Opportunities. JMIR Med. Inform. 2021, 9, e21929. [Google Scholar] [CrossRef]
Stuhlmiller, T.J.; Rabe, A.J.; Rapp, J.; Manasco, P.; Awawda, A.; Kouser, H.; Salamon, H.; Chuyka, D.; Mahoney, W.; Wong, K.K.; et al. A Scalable Method for Validated Data Extraction from Electronic Health Records with Large Language Models. medRxiv 2025. [Google Scholar] [CrossRef]
Sang, E.F.T.; de Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proc. CoNLL-2003 2003, 2003, 142–147. [Google Scholar]
Comeau, D.C.; Islamaj Doğan, R.; Ciccarese, P.; Cohen, K.B.; Krallinger, M.; Leitner, F.; Lu, Z.; Peng, Y.; Rinaldi, F.; Torii, M.; et al. BioC: A minimalist approach to interoperability for biomedical text processing. Database 2013, 2013, bat064. [Google Scholar] [CrossRef] [PubMed]
Ali, O.; Shrestha, A.; Ghasemaghaei, M.; Beydoun, G. Assessment of Complexity in Cloud Computing Adoption: A Case Study of Local Governments in Australia. Inf. Syst. Front. 2022, 24, 595–617. [Google Scholar] [CrossRef]
Kumar, R.; Meghji, A.F.; Wadho, S.A.; Shaikh, F.B.; Shaikh, U.; Ayyasamy, R.K. Implementation and Challenges in Cloud Computing. J. Inf. 2024, 17, 81–88. [Google Scholar]
Makhlouf, R. Cloudy Transaction Costs: A Dive into Cloud Computing Economics. J. Cloud Comput. 2020, 9, 1. [Google Scholar] [CrossRef]
Zikos, D.; DeLellis, N. CDSS-RM: A Clinical Decision Support System Reference Model. BMC Med. Res. Methodol. 2018, 18, 137. [Google Scholar] [CrossRef] [PubMed]
Freund, F.; Tamla, P.; Hemmje, M. Optimizing Data Preprocessing for Named Entity Recognition in Medical Knowledge Management. Data Min. Knowl. Discov. 2024, 24, 221. [Google Scholar]
Wang, S.; Sun, X.; Li, X.; Ouyang, R.; Wu, F.; Zhang, T.; Li, J.; Wang, G. GPT-NER: Named Entity Recognition via Large Language Models. arXiv 2023, arXiv:2304.10428. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the EMNLP, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python. Available online: https://spacy.io (accessed on 8 May 2025).
Microsoft. Cloud Computing Services | Microsoft Azure. 2024. Available online: https://azure.microsoft.com/ (accessed on 8 May 2025).
Hemmje, M.; Jordan, B.; Pfenninger, M.; Madsen, A.; Murtagh, F.; Kramer, M.; Bouquet, P.; Hundsdörfer, A.; McIvor, T.; Malvehy, J.; et al. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3): R&D White Paper; Research Institute for Telecommunication and Cooperation (FTK): Dortmund, Germany, 2020. [Google Scholar]
Donovan, R.; Healy, M.; Zheng, H.; Engel, F.; Vu, B.; Fuchs, M.; Walsh, P.; Hemmje, M.; Mc Kevitt, P. SenseCare: Using Automatic Emotional Analysis to Provide Effective Tools for Supporting. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine, Madrid, Spain, 3–6 December 2018; pp. 2682–2687. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a Freely Accessible Critical Care Database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Kim, J.D.; Pyysalo, S. BioNLP Shared Task. In Encyclopedia of Systems Biology; Dubitzky, W., Wolkenhauer, O., Cho, K.H., Yokota, H., Eds.; Springer: New York, NY, USA, 2013; pp. 138–141. [Google Scholar] [CrossRef]
Cohen, K.B.; Verspoor, K.; Fort, K.; Funk, C.; Bada, M.; Palmer, M.; Hunter, L.E. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain. In Handbook of Linguistic Annotation; Ide, N., Pustejovsky, J., Eds.; Springer: Dordrecht, The Netherlands, 2017; pp. 1379–1394. [Google Scholar] [CrossRef]
Furrer, L.; Cornelius, J.; Rinaldi, F. UZH@CRAFT-ST: A Sequence-labeling Approach to Concept Recognition. In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Hong Kong, China, 4 November 2019; pp. 185–195. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2020, arXiv:1904.05342. [Google Scholar]
Norman, D.A.; Draper, S.W. User Centered System Design: New Perspectives on Human-Computer Interaction, 1st ed.; CRC Press: Boca Raton, FL, USA, 1986. [Google Scholar] [CrossRef]
Jacobson, L.; Booch, J.R.G. The Unified Modeling Language Reference Manual; Pearson Education: London, UK, 2021. [Google Scholar]
Gamma, E.; Helm, R.; Johnson, R.; Vlissides, J. Design Patterns: Elements of Reusable Object-Oriented Software; Addison-Wesley Professional Computing Series; Pearson Education: London, UK, 1994. [Google Scholar]
Kilicdagi, A.; Yilmaz, H.I. Laravel Design Patterns and Best Practices; Packt Publishing: Birmingham, UK, 2014. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar] [CrossRef]
Azure Kubernetes Service (AKS) Documentation | Microsoft Learn. Available online: https://learn.microsoft.com/en-us/azure/aks/ (accessed on 8 May 2025).
Shelar, H.; Kaur, G.; Heda, N.; Agrawal, P. Named entity recognition approaches and their comparison for custom ner model. Sci. Technol. Libr. 2020, 39, 324–337. [Google Scholar] [CrossRef]
Polson, P.G.; Lewis, C.; Rieman, J.; Wharton, C. Cognitive walkthroughs: A method for theory-based evaluation of user interfaces. Int. J. -Man-Mach. Stud. 1992, 36, 741–773. [Google Scholar] [CrossRef]
IEEE. IEEE Standard for Software Reviews and Audits; IEEE: New York, NY, USA, 2008. [Google Scholar] [CrossRef]
Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In Proceedings of the Advances in Information Retrieval; Losada, D.E., Fernández-Luna, J.M., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar] [CrossRef]
Stenetorp, P.; Pyysalo, S.; Topić, G.; Ohta, T.; Ananiadou, S.; Tsujii, J. brat: A Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; pp. 102–107. [Google Scholar]
CCP, U.D.A. Alternative Annotation File Formats · UCDenver-ccp/CRAFT Wiki · GitHub. 2024. Available online: https://github.com/UCDenver-ccp/CRAFT/wiki/Alternative-annotation-file-formats (accessed on 8 May 2025).

Figure 1. Use Cases of the FIT4NER System with Prepare4NER and spaCNAC Subsystem.

Figure 2. Prepare4NER Components.

Figure 3. spacNAC Components.

Figure 4. Architecture Overview of Prepare4NER and spaCNAC Integration into FIT4NER.

Figure 5. spacNAC spaCy Model Types.

Figure 6. Available spacNAC Azure Compute Clusters.

Figure 7. spacNAC NER Model UI.

Figure 8. Exporting a Dataset from Prepare4NER to spaCNAC within KM-EP.

Table 1. F-Score results of the experiments using the CRAFT corpus CHEBI in [42].

Model	F-Score
OGER (baseline)	0.5808
BiLSTM no-pretraining	0.7293
BiLSTM pretraining	0.7412
BiLSTM-pick-best	0.7442
BERT IDs	0.7555
BERT spans+OGER	0.6586
BERT IDs+spans+OGER	0.7700

Table 2. Azure virtual compute resources used for training.

Azure-VM	vCPUs	RAM	GPU	Price
Standard_A2_v2	2	4 GB	No	0.09 $/Hour
Standard_D11_v2	2	7 GB	No	0.15 $/Hour
Standard_A2m_v2	2	16 GB	No	0.12 $/Hour
Standard_NC4as_T4_v3	4	28 GB	Yes	0.53 $/Hour

Table 3. Experiment 1: Conducting Model Training Utilizing Azure CPU-based Machines.

Azure-VM	vCPUs	RAM	Time	P	R	F1
Standard_A2_v2	2	4 GB	14 min	-	-	-
Standard_D11_v2	2	7 GB	13 min	0.5758	0.3857	0.4619
Standard_A2m_v2	2	16 GB	32 min	0.8344	0.7443	0.7868

Table 4. Experiment 2: Conducting Model Training Utilizing Various Transformer HuggingFace Models.

Nbr.	Model	Time	P	R	F1
1	BERT	110 min	0.8358	0.7761	0.8049
2	GPT_2	112 min	0.8430	0.8015	0.8218
3	ClinicalBERT	129 min	0.8593	0.7857	0.8208
4	Roberta	93 min	0.8524	0.8158	0.8337

Table 5. Experiment 3: Conducting Model Training Across Varying Learning Rates.

Nbr.	Model	Learning Rate	Time	P	R	F1
1	Roberta	$3 \times 10^{- 5}$	104 min	0.8481	0.8158	0.8317
2	Roberta	$5 \times 10^{- 5}$	110 min	0.8734	0.8	0.8351
3	Roberta	$1 \times 10^{- 4}$	93 min	0.8524	0.8158	0.8337

Table 6. Tasks for the Cognitive Walkthrough.

#	Task
T01	Accessing the KM-EP System
T02	Navigating to the spaCy Azure NER Model Definitions
T03	Creating a New Model Definition
T04	Setting Up the Corpus for the NER Model Training
T05	Configuring the spaCy Model Architectures
T06	Setting Up the spaCy Hyperparameters
T07	Configuring Advanced spaCy Settings
T08	Selecting the Azure Compute Cluster
T09	Initiating the NER Training Process
T10	Navigate to the spaCy Azure NER Model Manager
T11	Review the Performance of the Model
T12	Review the Compute Resources Utilized by the Pipeline

Table 7. Evaluation Results of the Cognitive Walkthrough.

Finding	Task	Result
CWF1	T02	Confusion about the NER Model Definitions Area
CWF2	T04	Uncertainty about data format of corpora
CWF3	T06	Concerning the language selection for the NER model training
CWF4	T03	Complexity of the navigation bar
CWF5	T03	Inefficient layout of the NER Model Definitions page
CWF6	T02–T12	Overwhelming amount of information on the user interface
CWF7	T03–T09	Lack of data saving confirmation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tamla, P.; Freund, F.; Hemmje, M. Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach. Information 2025, 16, 395. https://doi.org/10.3390/info16050395

AMA Style

Tamla P, Freund F, Hemmje M. Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach. Information. 2025; 16(5):395. https://doi.org/10.3390/info16050395

Chicago/Turabian Style

Tamla, Philippe, Florian Freund, and Matthias Hemmje. 2025. "Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach" Information 16, no. 5: 395. https://doi.org/10.3390/info16050395

APA Style

Tamla, P., Freund, F., & Hemmje, M. (2025). Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach. Information, 16(5), 395. https://doi.org/10.3390/info16050395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach

Abstract

1. Introduction and Motivation

2. State of the Art in Science and Technology

3. Conceptual Modeling and Design

4. Implementation

5. Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI