A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition

Freund, Florian; Tamla, Philippe; Wilde, Frederik; Hemmje, Matthias

doi:10.3390/info16070554

Open AccessArticle

A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition^†

Faculty of Mathematics and Computer Science, University of Hagen, 58097 Hagen, Germany

^*

Authors to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in This article is a revised and expanded version of a paper entitled “Survey: Understand the challenges of MachineLearning Experts using Named Entity Recognition Tools”, which was presented at 6th International Conference on Natural Language Processing, Information Retrieval and AI (NIAI 2025) 25–26 January 2025, Copenhagen, Denmark.

^‡

These authors contributed equally to this work.

Information 2025, 16(7), 554; https://doi.org/10.3390/info16070554

Submission received: 23 April 2025 / Revised: 13 June 2025 / Accepted: 24 June 2025 / Published: 29 June 2025

(This article belongs to the Special Issue Emerging Applications of Machine Learning in Healthcare, Industry, and Beyond)

Download

Browse Figures

Versions Notes

Abstract

This article focuses on assisting medical professionals in analyzing domain-specific texts and selecting and comparing Named Entity Recognition (NER) frameworks. It details the development and evaluation of a system that utilizes a generic approach alongside the structured Nunamaker methodology. This system empowers medical professionals to train, evaluate, and compare NER models across diverse frameworks, such as Stanford CoreNLP, spaCy, and Hugging Face Transformers, independent of their specific implementations. Additionally, it introduces a concept for modeling a general training and evaluation process. Finally, experiments using various ontologies from the CRAFT corpus are conducted to assess the effectiveness of the current prototype.

Keywords:

natural language processing; named entity recognition; machine learning; cloud computing; medical expert systems; clinical decision support

1. Introduction and Motivation

Medical Named Entity Recognition (NER) is a foundational Natural Language Processing (NLP) technology designed to alleviate Information Overload (IO) in the development of Clinical Practice Guidelines (CPGs) [1]. The creation of such guidelines typically relies on Evidence-Based Medicine (EBM), which involves comprehensive research and meticulous documentation of evidence [2]. As a result, medical professionals must analyze vast amounts of unstructured text, including scientific literature, research findings, and clinical studies. NER facilitates the extraction of knowledge from this unstructured data, transforming it into structured information in accordance with Kuhlen’s information model [1,3]. In medicine, specialized vocabularies are used to define diagnoses, findings, and treatments [4]; thus, it is essential for medical NER to accurately identify such domain-specific Named Entities (NEs). Recent advancements in Artificial Intelligence (AI), particularly in Machine Learning (ML) and Deep Learning, have significantly improved the performance of state-of-the-art NER systems [5]. However, these developments have also led to a proliferation of diverse NER frameworks and models, with new systems emerging continuously [6]. Therefore, systematically comparing these systems is crucial to determine the most suitable one for specific use cases. A generic approach that enables the integration, comparison, and selection of various NER frameworks—regardless of their implementation specifics—would offer substantial benefits. Because medical NER often requires the recognition of highly specific NEs, it is usually necessary to train or adapt ML-based NER models [4]. However, these processes are often complex and inaccessible to medical experts who may lack a technical background.

Following this contextual and terminological overview, the motivation for this study is presented through relevant research initiatives. RecomRatio [7], a European initiative, supports healthcare professionals in making informed decisions by presenting treatment options along with their respective advantages and disadvantages. It employs the Content and Knowledge Management Ecosystem Portal (KM-EP) Knowledge Management System (KMS) to enable evidence-based decisions drawn from the medical literature. Developed collaboratively by the Chair of Multimedia and Internet Applications at the Faculty of Mathematics and Computer Science, FernUniversität, Hagen [8], and the FTK e.V. Research Institute for Telecommunications and Cooperation [9], this system incorporates emerging NEs into medical documentation [10]. In the H2020 project proposal Artificial Intelligence for Hospitals, Healthcare, and Humanity (AI4H3), a hub-centric architecture was proposed [11], designed not only for managing medical and patient records but also for supporting clinical decision-making. This architecture includes a central AI hub with a semantic fusion layer for data and model storage. AI4H3 integrates KlinSH-EP, an evolution of the KM-EP system [12]. Within the AI4H3 framework, the Cloud-based Information Extraction (CIE) project [13] was initiated to develop a cloud-based NER pipeline capable of handling large-scale datasets. The CIE architecture allows multiple users to configure cloud resources and train NER models in a scalable environment. Building on this, the Framework-Independent Toolkit for Named Entity Recognition (FIT4NER) project [14], part of CIE, supports medical professionals in text analysis through a generic approach utilizing a variety of NER frameworks. It enables experimentation with different frameworks and models within KM-EP, allowing users to evaluate and choose the most suitable solution for their specific tasks. KM-EP manages documents such as clinical and research findings, serving as an evidence base for CPGs. To enhance access to these documents, it provides classification via NER and a faceted search engine based on the extracted entities [15]. The integrated ML-based NER models in KM-EP should ideally be trained by medical experts to account for domain-specific terminology, abbreviations, and potential spelling variations [4].

Nonetheless, the dynamic nature of NER research presents several challenges. First, medical experts must compare a range of tools to identify the most appropriate ones [6]. This comparison is often difficult due to numerous configurable parameters and differing performance metrics, complicating the identification of tools that effectively extract, analyze, and visualize NLP features. Second, the performance of selected tools is a critical success factor in NER projects [6]. Third, users often lack the computational and storage capabilities necessary to train domain-specific NER models [13]. While cloud computing offers a potential solution, many experts lack the technical expertise to utilize it effectively [13].

Given these challenges, this study aims to develop and evaluate a flexible NER system that enables medical professionals to efficiently analyze domain-specific texts and compare NER frameworks [16]. The primary Research Objective (RO) is to design a generic approach that supports the integration of various NER frameworks. This leads to the following Research Questions (RQs), addressed in a master’s thesis inspired by the CIE and FIT4NER projects [17]:

RQ1: How can generic approaches standardize the training and evaluation process of ML-based NER models across different frameworks?
RQ2: How can an abstraction layer for the training process of ML-based NER models be implemented and evaluated in a prototype system to support the selection and comparison of various NER frameworks?

To comprehensively address these RQs, this study follows the Nunamaker methodology [18], a structured and widely recognized approach. It outlines specific ROs across four phases: Observation, Theory Building, System Development, and Experimentation. Section 2 covers the observational objectives through a detailed review of the state of the art. Section 3 focuses on theory building by developing models to help medical professionals train ML-based NER systems using a generic approach. Section 4 addresses system development by presenting the prototype for model training. Section 5 discusses experimentation, including quantitative evaluation of the prototype. Finally, Section 6 summarizes the study’s main findings.

2. State of the Art in Science and Technology

This section examines the observation phase, delivering a foundation for this work and pertinent research activities. It seeks to review the current state of the art and ascertain potential Remaining Challenges (RCs) within the domains addressed. Initially, NER is introduced, and a literature review is conducted to identify existing abstraction projects with similar objectives. Subsequently, the key elements of a general training and evaluation process are discussed. Finally, the suitability of various design patterns for developing an abstraction layer for NER frameworks across different programming languages is examined.

NER is a crucial NLP technique designed to extract named entities from unstructured text documents [19]. In the medical domain, NER plays a pivotal role in Clinical Decision Support Systems (CDSSs) and facilitates clinical information mining from Electronic Health Records (EHRs) [20]. Recent years have witnessed significant advances in NER, primarily driven by the development of novel techniques and models, including deep learning [21]. These advances have substantially improved the performance of NER systems, establishing NER as one of the most extensively researched NLP tasks [21]. Various approaches to NER exist, including traditional, ML-based, and hybrid methods [22]. Traditional approaches rely on fixed rules or dictionaries, whereas ML-based approaches use statistical models that learn from annotated data [22]. The popularity of approaches based on ML has increased due to their ability to efficiently process large unstructured datasets [21]. Large Language Models (LLMs) such as BERT [23], GPT-2 [24], and RoBERTA [25] have further improved NER performance and can be fine-tuned on smaller domain-specific datasets. Rapid evolution in the field of NER has led to the development of new tools and frameworks that address various challenges and use cases. Given the wide range of options available, it is essential to compare and select the most suitable tools for specific tasks. To integrate, compare, or interchange various NER frameworks regardless of their specific implementations, it is advisable to adopt a “generic approach”. In the engineering context, Murray-Smith defines a structure as “generic” if it “allows reuse of simulation software for a wide range of different projects with relatively minor reorganisation” [26] (p. 304, Chapter 9.4). Applied to NER, this leads to the development of a universal, reusable, and standardized interface or methodology that facilitates these processes independently of specific implementations. In the literature, there are several initiatives that aim to abstract and simplify the training of NLP models. A well-known example is Keras [27], a high-level library based on TensorFlow [28], an open-source framework from Google for creating and executing ML models. Keras is used primarily for deep neural networks and allows researchers to create deep learning models with minimal coding effort [27]. Although Keras provides useful abstractions for general ML, it is often insufficient for NLP due to the complexity and variability of the data types in NLP [29]. Another example is the OpenNMT neural machine translation toolkit, which was specifically developed for machine translation [30]. OpenNMT offers a solid foundation for training models that translate text between languages, but it is limited to this specific task and provides little support for other NLP tasks such as text classification or NER [30]. In the field of dialogue research, there is ParlAI [31], a dialogue systems research platform that enables the training and comparison of various models [31]. ParlAI is designed for the simulation and analysis of conversations and offers specialized tools for these tasks, but like OpenNMT, it is restricted to a specific application and offers limited flexibility for other NLP tasks [31]. Unlike these specialized toolkits, AllenNLP [29] provides a general platform for a wide range of NLP tasks. It abstracts not only model creation but also data processing and experimentation with different model architectures [29]. These abstractions allow researchers to focus on the primary goals of their research without dealing with implementation details. A unique feature of AllenNLP is the ability to define models using declarative configuration files [29]. These configuration files enable researchers to make key decisions about model architecture and training parameters without changing the code, promote the reproducibility of experiments, and facilitate the sharing of research results [29]. Another approach in the medical field is MedNER [32], a service-oriented framework for NER. MedNER allows medical professionals to work with various NER models but does not support the use of different NER frameworks or the training of custom ML-based NER models. The presented NLP abstraction projects offer some helpful approaches but are predominantly designed for other NLP areas and not directly tailored to the requirements of NER training. Keras provides a simplified abstraction for model training but is not natively optimized for NER. AllenNLP offers a structured approach to model configuration through declarative configuration files, which can support the reproducibility and standardization of NER experiments. OpenNMT and ParlAI focus on machine translation and dialogue-oriented applications, making their applicability to generic NER limited. In general, Keras and AllenNLP are valuable inspirations due to their abstraction and configuration approaches but are not suitable for the goal of cross-platform training and NER-focused training due to their focus on Python and other NLP areas. Therefore, RC1 remains to develop an abstraction layer for ML-based NER that supports the comparison and selection of NER frameworks.

Before developing an abstraction layer for ML-based NER, it is crucial to understand the phases involved in the development of an NER model. The process model for AI-based knowledge extraction support for CPG development (Figure 1) divides this process into five phases [1]: “Data Management and Curation”, “Analytics”, “Interaction and Perception”, “Model Deployment”, and “Insight and Effectuation”. This work focuses on the “Analytics” phase. In this phase, the primary actor is the defined stereotype “Model Definition User”, who selects an NER framework, defines the Hyperparameters (HPs), and trains the model. Subsequently, the evaluation and comparison with other NER frameworks are made. These steps are performed iteratively until the Model Definition User is satisfied with the results and the model can be deployed in the subsequent phases for productive use. HPs are crucial factors in model training and significantly impact model performance. These parameters are set before the training process begins and dictate the behavior of the model during training [33]. The optimal selection of HPs can greatly improve model performance. An example of an HP is the learning rate, which indicates how quickly the model learns during training. A learning rate that is too high can lead to instability, whereas a rate that is too low can slow down the convergence of the model [34]. Therefore, choosing the optimal learning rate is vital for model performance. Another important HP is the number of epochs, which refers to the number of passes through the training dataset. Too few epochs can result in insufficient model adaptation, while too many can lead to overfitting. The correct choice of the number of epochs is also crucial. Other HPs that influence model behavior include the batch size, regularization parameters, and number of hidden layers in a neural network [34]. Selecting and optimizing these HPs is a key task in model development and often requires experimental approaches and experience. After training, the model undergoes evaluation and validation. To assess the performance of an NER model or implementation, the results must be quantified [35]. Standard metrics such as Precision (P), Recall (R), and F-Score (F1) are used to measure the efficiency of an NER model [35]. P indicates the percentage of entities correctly identified by the system [36], while R describes the percentage of entities present in the corpus that are detected by the system [36]. The F1, calculated as the harmonic mean of P and R, provides a comprehensive performance metric [37]. Another important metric is Accuracy, which measures the proportion of correct predictions (both correct identifications and correct non-identifications) out of the total number of predictions. This metric offers a general overview of the model’s performance, considering both correct and incorrect decisions [38]. Thus, RC2 remains to develop a standardized training process for medical experts that incorporates essential HPs such as the learning rate and the number of epochs, as well as evaluation metrics such as P, R, and F1.

As part of the FIT4NER project, a system has been designed to assist medical experts in creating domain-specific NER models using various frameworks [16]. To achieve this, three NER frameworks were selected as examples: Stanford CoreNLP [39], a proven Java-based NER framework that has been utilized in numerous projects, together with spaCy [40] and Hugging Face Transformers [41], which were chosen due to their widespread use among experts, as indicated by the previous survey [6]. All aforementioned frameworks can be controlled through configuration files, thus supporting the concept of Config-Driven Development (CDD) [42]. CDD [42] is an approach in which the configuration of an application or system is centralized and separated from the code logic, allowing adjustments and changes to be made through configuration files without altering the underlying code. Many other NER frameworks are expected to also support CDD. An abstraction layer is essential to ensure that the code is modular, extensible, and maintainable, with design patterns playing a crucial role [43]. Particularly relevant are the Strategy Pattern, the Abstract Factory Pattern, and the Bridge Pattern [43]. The Strategy Pattern [43] facilitates the definition of a family of algorithms that can be interchanged independently of their context. It allows for the dynamic adaptation of algorithm implementations in different programming languages. However, it is less suitable for multilingual abstractions, as its focus is on algorithm interchangeability within a single programming language. The Abstract Factory Pattern [43] provides an interface for creating related objects without specifying concrete classes. It can generate platform-independent instances of NER components in various languages. However, complexity increases with the number of components to manage, necessitating clearly defined interfaces. The Bridge Pattern [43] decouples abstractions from their implementations, allowing both to evolve independently. It conceals implementation specifics behind a unified abstraction layer, enabling a clear separation between generic and specific levels. This pattern offers the flexibility needed to design abstractions that are easily extendable and adaptable. In summary, RC3 is to take advantage of the Bridge Pattern to develop a cross-platform abstraction layer that effectively supports the training of NER models across multiple programming languages.

In this section, a critical analysis of the current state of the art relevant to this research has been carried out, leading to the identification of three RCs. The first challenge (RC1) involves the development of an abstraction layer for ML-based NER that facilitates the comparison and selection of NER frameworks. The second challenge (RC2) focuses on establishing a standardized training process for medical professionals, incorporating fundamental HPs such as the learning rate and number of epochs, in conjunction with evaluation metrics such as P, R, and F1. The third challenge (RC3) concerns the implementation of the Bridge Pattern to construct a cross-platform abstraction layer that efficiently supports the training of NER models in various programming languages. The subsequent section provides a comprehensive account of the development of appropriate models designed to tackle all identified challenges (RCs 1–3).

3. Modeling

Subsequent to a comprehensive examination of the prevailing developments in science and technology, this section advances to the stage of theory building, with the formulation of fundamental models for an abstraction layer that aids medical experts in constructing NER models using various NER frameworks and evaluating their outputs. This endeavor is achieved by taking into account the RCs delineated in Section 2, which mirror the most recent advancements in the field. For the purposes of design and conceptual modeling, the User-Centered System Design (UCSD) method [44] is adopted, with Unified Modeling Language (UML) [45] acting as the specification language. As part of the FIT4NER project, the Framework-Independent Layer for Training and Applying Named Entity Recognition (FILTANER) subsystem was developed based on a master’s thesis [17]. The Use Cases (UCs) created in this work are shown in Figure 2 and form the foundation of the study [16]. The overarching FIT4NER UCs are depicted in green, while the detailed FILTANER UCs are shown in magenta. Based on these UCs, a generic training process for ML-based NER models is developed, which can be trained with various NER frameworks. This generic training process is then represented in an abstraction layer and integrated into the KM-EP system.

Initially, a concept for modeling a generic training and evaluation process for NER models is introduced, applicable to frameworks such as Hugging Face Transformers, Stanford CoreNLP, and spaCy. Despite differing implementations, these frameworks share similar resources and functionalities that enable a comparable training process, particularly in data processing, model training, and model evaluation. The proposed method serves as a foundation for abstracting the training process by decoupling training-specific parameters from the actual implementation of the framework. The studies mentioned in Section 2 indicate that CDD is a promising approach for this purpose. This allows key settings, such as the learning rate and number of epochs, to be configured independently of the framework used. This separation facilitates designing and testing the training process for various models and frameworks within a unified framework, even when implementation details vary. The goal is to develop a generic approach based on the highest abstraction levels of the frameworks, controlled via a configuration file. The model shown in Figure 3 illustrates the resources required for NER training and demonstrates how interaction with these resources can be decoupled from implementation details through the use of CDD.

The generic process begins with the optional loading of a base model, especially when fine-tuning a pre-trained transformer model. This step can be skipped if the training dataset is sufficiently large to support training from scratch or if transformer-based models are not supported by the selected framework. Once any base model is initialized, data pre-processing follows—an essential step for transforming raw input into a standardized format compatible with the target framework. Typical pre-processing tasks include tokenization, normalization, and data formatting. Without thorough and consistent data preparation, irregularities or incompatible formats can significantly degrade model performance. The next key step involves initializing training parameters using a configuration file, which defines cross-framework HPs such as the learning rate, batch size, and number of training epochs. As introduced in Section 2, careful selection of these HPs is critical and aligns with the system’s design objective: FIT4NER aims to empower medical domain experts to experiment with diverse NER frameworks without requiring in-depth ML expertise. To lower the entry barrier and reduce cognitive complexity, the user interface does not expose all framework-specific architectural options. Instead, the system provides predefined training profiles (e.g., Fast, Medium, and Accurate), which may internally configure framework-specific parameters [16]. These profiles are designed to balance ease of use with performance, enabling domain-specific experimentation with minimal effort. Importantly, the system remains fully open to advanced users. All training configurations are documented, versioned, and reusable. Through advanced interfaces in CDD, users can directly modify framework-specific parameters via configuration files, enabling both low-threshold access for non-experts and high flexibility for experienced users. This decoupled approach enhances reusability and maintainability, as configuration files can be adapted independently of the core codebase. After training is complete, the resulting model is saved, allowing it to be reused for further evaluations or applications. The process concludes with model evaluation using validation data, which measures how effectively the model performs the intended task, assessing its robustness and generalization to unseen data.

The model shown in Figure 4 illustrates the generic training process as a UML activity diagram, designed to be implemented independently of the NER framework. The process begins with loading a configuration file that specifies all training parameters. Next, a base model is loaded, which can be either a pre-trained model or a new, empty model. After optionally loading a base model, the training data is read. If the data is not already in a format supported by the framework, it undergoes pre-processing, including tokenization, normalization, and annotation, with the appropriate entity tags. These prepared data serve as the basis for actual training. The training parameters are extracted from the configuration file and initialized. The model training proceeds iteratively, allowing the model to learn the relationships between words and their associated NEs. Upon completion of the training, the successful execution is verified, and the model is saved. Subsequently, the trained model is evaluated using a separate validation set to assess its generalization capability. The results of this evaluation are compiled into a detailed report. This process provides a structured framework for the development of NER models, offering the flexibility to test and compare various models and configurations.

To integrate this generic training process into KM-EP, multiple components must work seamlessly. Figure 5 provides a detailed description of the components and interfaces involved, which have already been thoroughly described in [16]. The architecture introduces a modular and scalable structure for managing and deploying trained models and configurations, enabling the dynamic use of various NER frameworks. Key components include Model Registry, Model Definition Registry, and NER Framework Service, which facilitate model application, evaluation, and training. The NER Framework Independent Service acts as middleware, providing centralized control and access via REST interfaces. The Service Registry manages the registration and discoverability of NER services, while entities such as TrainJob, EvaluationJob, and ApplyJob represent training, evaluation, and application processes. The Model-View-Controller (MVC) model shown in Figure 6 illustrates the technical architecture, depicting the integration and interaction of individual elements within the overall system. It outlines the modular architecture and implementation of the Bridge Pattern of the system for the registration, configuration, training, evaluation, and application of NER models. The architecture is based on a clear separation of responsibilities across various components that communicate via REST interfaces. The following sections provide a detailed explanation of each component and its interactions. Central to the architecture are the registry components, which facilitate the storage and management of models and configurations. The Object Storage, based on MinIO, is used to store trained models in the Model Registry and configuration data in the Config Registry. The Model Registry Controller and the Model Definition-Registry Controller, both implemented as microservices using Python FastAPI, enable access to these data. The Model Registry Controller manages the model metadata and provides APIs for registering, retrieving, or updating models. The Model Definition-Registry Controller offers similar functionalities for configurations, including versioning and profile management. The NER Framework Service components form the functional core of the architecture. The Training Service enables the training of NER models by processing training data, configurations, and optional base models. Training jobs implemented in Java or Python are initiated here. The Evaluation Service, accessed via the same controller in the NER Framework Service, evaluates trained models using validation data, executing evaluation jobs that generate metrics such as P, R, and F1. The application of the model is facilitated by the Apply Service, which allows trained models to be applied directly to input texts, with the results forwarded to other components. The NER Framework Independent Service implements the Bridge Pattern and acts as the interface between the NER Framework Service and the KM-EP system. Also implemented with Python FastAPI, it includes a Service Registry using Python TinyDB, storing information about available frameworks. The Usage Controller translates user requests from the KM-EP system into calls to the respective microservices, retrieving data such as configuration profiles or models from the registry components and processing microservice results before returning them to the KM-EP system. The KM-EP system is the entry point for users. Through a user interface implemented in PHP with Symfony, tasks such as training or evaluating NER models can be initiated. The controller processes user requests and communicates with the microservices through the NER Framework Independent Service. The results are then asynchronously retrieved by the FILTANER Controller from the Usage Controller and presented through the FILTANER View. The FILTANER Model within the KM-EP system manages the application logic and data models necessary for interaction with the NER Framework Independent Service. The interaction between components is conducted consistently via REST protocols, ensuring clear and structured communication. This modular architecture achieves a clear separation of responsibilities, allowing the various microservices to be independently scalable and extendable. Simultaneously, the NER Framework Independent Service enables centralized coordination and unified management of models and configurations, providing a flexible and efficient solution for working with NER models. Thus, the developed models effectively address all outstanding challenges (RCs 1–3).

4. Implementation

This section covers the system development phase, focusing on the RO to develop a middleware that serves as an NER Framework Independent Service, providing a generic approach for the training, evaluation, and application of NER models while enabling the integration of frameworks such as Stanford CoreNLP, spaCy, and Hugging Face Transformers. Initially, the Model Definition Registry and Model Registry are described, which handle central management tasks for model and configuration data. Following this, the implementation of the NER Framework Service is explained, which operates as a standalone microservice responsible for the specific training, evaluation, and application processes of each framework. Based on this, the implementation of the NER Framework Independent Service is detailed, which acts as an abstraction layer and offers a consistent interface for overarching workflows.

The Model Registry and Model Definition Registry play a central role in managing resources related to NER models. Both registries have similar architectures and offer comparable functionalities, yet they differ in the type of objects they manage and their specific roles within the overall process. They are based on object-oriented data storage, housed in an S3-compatible object storage solution (MinIO). Models and configurations are identifiable by unique keys, such as the key, the name, and the registry_url. The registries provide FastAPI-developed interfaces for storing, retrieving, deleting, and tagging stored files, ensuring persistent storage and retrieval of collected models and necessary configuration files. The main differences lie in the type of stored data and the specific purpose of each registry. The Model Registry manages trained NER models, making them available to NER services for application, evaluation, or training. In contrast, the Model Definition Registry is responsible for storing and managing the configuration files required for creating and reproducing these models. The endpoints of these registries follow a similar schema, differing only in the specific designations for the stored artifacts, which simplifies integration with other system components. The deployment consists of three containers: MinIO for object storage of models and configuration files, and two FastAPI microservices—filtaner-model-registry for model versioning and filtaner-model-definition-registry for model definitions. MinIO provides the Model Bucket and the Config Bucket, where the registries can store and retrieve their data. The registries interact with these buckets to manage and provide models and definitions.

As part of this work, prototypes of the NER framework services for CoreNLP, spaCy, and Hugging Face Transformers were implemented. These prototypes utilize the FastAPI framework and employ individual scripts as well as specific implementations of the respective features. The basic structure of the implementation is similar across all services: The RegistrationController.py registers the service and loads configurations into the Model Definition Registry for training profiles. The Service Controller provides all FastAPI resources that can be used by individual services. All NER framework services use Pydantic models to ensure consistent, type-safe, and easily maintainable API interfaces. Each NER Framework Service must offer the following four endpoints: train_model, evaluate_model, apply_model, and get_heartbeat. Additionally, each service includes a dedicated folder where framework-specific implementations are performed. All microservices contain at least three files in these folders that fulfill the respective tasks (Train, Evaluate, and Use). In Stanford CoreNLP, the Java classes are additionally accessed via a Python subprocess call. In Listing 1, an example is provided of how the Python-based microservice interacts with the Java-based Stanford CoreNLP component. Initially, the training and evaluation data, originally in JSON format, are converted to the TSV format required by CoreNLP by changing the file extension from .json to .tsv and invoking the json_to_tsv function. Next, the create_temp_properties function generates a temporary properties file that contains all the configuration parameters and the paths to the TSV files. This properties file serves as the foundation for the subsequent training process. Using the Python subprocess module, a Java process is initiated, which allocates 8 GB of memory with the “-mx8g” argument and loads the necessary JAR files via the specified classpath. Through this process, the Java class edu.stanford.nlp.ie.crf.CRFClassifier is called, which initiates the training of the CRF-based model using the prepared properties file. In addition to using the official CoreNLP classes, two supplementary Java classes were developed for application and evaluation within the CoreNLP-NER service. The Java implementation in the ApplyCoreNLP class reads an input file and loads a trained NER model based on the model path provided to process the text using a Stanford CoreNLP pipeline. The detected NEs are then output in a structured JSON object, which includes the original text and details for each identified token (text, label, start, and end position). In the corresponding Python implementation, the Java process is initiated asynchronously via a Java jar call, and the JSON result is returned.

In the communication between the NER framework services and the registry components, the Model Definition Registry is used only for writing in the Registration Controller and reading during the training process. In contrast, the Model Registry is utilized across all resources. During registration, the respective NER Framework Service can specify whether a base model from the Model Registry can be used during training. If this is the case, the model is loaded from the registry as needed, and training continues. If such a feature is absent, as is the case with Stanford CoreNLP, the same model installed as a dependency by the service is used for each training process.

The prototypical implementation of the NER Framework Independent Service is based on a FastAPI application written in Python and deployed within a Docker container. This component has two primary functions: it acts as a Service Registry to manage various NER framework services and store their metadata, and it serves as a Usage Controller, overseeing the management and execution of training, evaluation, and application jobs that are asynchronously forwarded to the registered NER framework services. Additionally, a heartbeat mechanism is implemented to periodically check each stored service and remove it from the database if no response is received within a specified timeframe, ensuring that the registry remains consistent and only references operational services. The Service Registry utilizes three key endpoints for registering (POST/register-ner-service/), querying (GET/ner-services/), and deleting (DELETE/delete-ner-service/framework_name) NER services. Each request searches the TinyDB instance for an existing entry, which is either updated or newly created. The Usage Controller supports requests such as training new models, evaluating existing models, and applying models to text data, with dedicated endpoints for each: POST /framework_name/train_model/, POST /framework_name/evaluate_model/, and POST /framework_name/apply-model/. Each call creates a job in the TinyDB instance with a “pending” status, followed by an asynchronous task that handles the actual processing. The request is immediately acknowledged while the data is uploaded in the background and forwarded to the registered endpoint. Upon completion, the status in jobs_db is updated to “completed” or “failed”, allowing client applications to query the current status of a job at any time via GET /job-status/job_id. When the train_model object is invoked in the background, it calls the train-model endpoint of the respective NER Framework Service, passing the necessary and optional parameters. The result of the request is then stored in the job database. The endpoints for evaluating (POST /framework_name/evaluate_model/) and applying (POST /framework_name/apply-model/) models follow the same principle. They create a job upon invocation, execute evaluate_model or apply_model in the background, and set the final status upon completion. This status can be retrieved via GET /job-status/job_id to obtain results upon successful completion or detailed information in the case of failure. Figure 7 summarizes the OpenAPI specification of the NER Framework Independent Service.

The KM-EP FILTANER Controller serves as the central management unit for the FILTANER components within KM-EP. It oversees the creation of model definitions, as well as the training, evaluation, and application of models. Essentially, it orchestrates the data flow between various services: it communicates with the NER Framework Independent Service for job creation and monitoring, with the Model Registry for model management, and with the Model Definition Registry for loading and creating configuration files. A key feature is asynchronous job management, which allows long-running training processes to be executed in the background and their status to be monitored. Following the MVC pattern, the controller also manages the PHP model based on Doctrine entities, where the FILTANER model objects represent the status and metadata of the models in the database. For user interaction, four different GUIs are provided via Twig templates: the FILTANER homepage (Figure 8), the interface to define and train NER models, the GUI to evaluate trained NER models, and the interactive GUI to apply trained NER models to text documents. A detailed description of these GUIs has been published in [16].

5. Evaluation

The previous section detailed the implementation of a prototype designed to assist medical experts in training ML-based NER models. This section focuses on the experimentation phase, addressing the RO of conducting and describing quantitative experiments on the developed prototype. The aim of the quantitative evaluation is to assess the performance of the NER framework services spaCy, Stanford CoreNLP, and Hugging Face Transformers, using metrics such as P, R, and F1, and to compare these results with other research initiatives. Although this study empirically evaluated only three widely used NER frameworks, FIT4NER is conceptually designed to be independent of specific NER frameworks. This independence is supported by both technical and methodological principles. As detailed in Section 3, the architecture of FIT4NER abstracts individual NER frameworks through clearly defined interfaces, while the principle of CDD enables the decoupled integration of new frameworks with minimal adaptation effort. Throughout the development process and internal qualitative evaluations, various model architectures were intentionally tested to identify potential limitations to this framework’s independence. No fundamental constraints were found that would hinder the integration of additional NER systems. Furthermore, a cognitive walkthrough evaluation conducted with domain experts revealed no indications of limitations to framework independence [16]. The selection of evaluated frameworks encompasses diverse paradigms—both statistical and transformer-based approaches—as well as different programming languages (Python and Java), ensuring broad functional coverage. Future work will aim to integrate and quantitatively evaluate additional NER frameworks (e.g., NLTK, Flair, and OpenNLP) to further validate and demonstrate the generalizability of the architecture.

To evaluate the prototype developed in this study, a suitable dataset is essential. The Colorado Richly Annotated Full Text (CRAFT) corpus [46] comprises 97 medical articles from PubMed, containing over 760,000 tokens, each meticulously annotated to identify entities such as genes, proteins, cells, cellular components, biological sequences, and organisms. Version 5 of the CRAFT corpus includes semantic annotations mapped to a variety of biomedical ontologies. Several studies have employed this corpus for quantitative evaluation. Basaldella et al. [47], Hailu et al. [48], and Langer et al. [49] conducted experiments using ontologies such as Chemical Entities of Biological Interest (CHEBI) [50], Cell Ontology (CL) [51], Protein Ontology (PR) [52], and Uber-anatomy Ontology (UBERON) [53], all of which are incorporated within the CRAFT corpus.

However, a direct performance comparison with previous studies [47,48,49]—which evaluated NER models trained on the CRAFT corpus using similar medical ontologies—is limited. These studies often lack transparency regarding the selection of ontology nodes (e.g., top-level vs. low-level hierarchies) and provide insufficient documentation of evaluation strategies and configuration details, making reproduction difficult. To assess the performance of the current prototype, the results from these prior works were used as a reference framework. The objective was not to achieve a direct numerical comparison, but rather to demonstrate that FIT4NER is capable of training and evaluating medical NER models effectively. Future work may pursue greater comparability through targeted reimplementations, leveraging available resources such as published models, source code, and configuration files. Nonetheless, the primary focus of this study is to validate the framework’s ability to support domain-specific evaluation processes under realistic conditions.

Before training experiments were conducted, the XML-based CRAFT annotations first needed to be transformed into the JSON format used internally by FILTANER. A conversion script was implemented for this purpose, following a four-step process: First, the ontology hierarchy was analyzed, then the raw annotations were loaded from the XML files and linked to the corresponding text documents. In the third step, the annotations were filtered according to specific biological target categories. The target categories (entity types) are listed by ontology in Table 1. Finally, the prepared data was split into training and evaluation sets with an 80:20 train–test split (train_ratio = 0.8). The data was randomly shuffled before splitting to ensure a representative distribution in both subsets. For training with spaCy and Hugging Face Transformers, three epochs with an initial learning rate of 0.001 were chosen, matching spaCy’s initial learning rate. Stanford CoreNLP used the standard configuration. For Hugging Face Transformers, the base model prajjwal1/bert-tiny [54,55] was used. Evaluation metrics were calculated on the basis of entity spans and therefore may differ from token-based experiments. This configuration was used to establish a fair and consistent basis for comparison between the frameworks. The focus of the evaluation was not on maximizing model performance through targeted hyperparameter tuning, but rather on demonstrating that FIT4NER can train effective NER models under realistic conditions without extensive optimization. The aim was to show that the results achieved with FIT4NER are in a performance range similar to that of previous studies and can thus be considered comparable to the state of the art. Although systematic hyperparameter optimization could be pursued in future work, it would primarily enhance model performance without significantly affecting the fundamental insights regarding the functionality and applicability of the developed system.

Table 2 and Table 3 summarize the results of the NER model training and compare them against the findings of Basaldella et al. [47], Hailu et al. [48], and Langer et al. [49].

Table 1 provides a comprehensive overview of the data used. In the performance metrics shown in Table 2, the spaCy framework achieves the best overall performance for the CHEBI ontology with an F1-score of 90.27% (highlighted in bold), followed by Hugging Face Transformers at 90.03%. SpaCy demonstrates the highest Precision (93.57%), while Hugging Face Transformers achieves the highest Recall (93.41%). Stanford CoreNLP lags significantly behind with 73.74%. These results surpass the models of Basaldella et al. [47] and Hailu et al. [48] and slightly exceed the model of Langer et al. [49]. For the CL ontology, the model of Basaldella et al. [47] achieves the best F1-score (91%), closely followed by spaCy (90.22%). Notably, Hugging Face Transformers (F1: 78.28%) is weaker, while CoreNLP (F1: 85.66%) has high Precision (94.23%) but lower Recall. The results for the PR ontology in Table 3 show that spaCy leads in all three metrics (P: 96.43%; R: 84.14%; and F1: 89.87%), significantly surpassing the models of Basaldella et al. [47] and Hailu et al. [48]. Hugging Face Transformers achieves an F1-score of 81.57%, while CoreNLP reaches 72.97%. For the UBERON ontology, spaCy leads with an F1-score of 90.11%, while CoreNLP records the highest Recall (91.85%). Hugging Face Transformers achieves an F1-score of 84.23%, comparable to that of CoreNLP and better than that of the model of Hailu et al. [48]. Overall, spaCy demonstrates the most consistent performance across all ontologies, achieving the highest F1-score in three out of four cases. CoreNLP and Hugging Face Transformers vary according to the ontology, with CoreNLP performing well in structural annotations (UBERON) and Hugging Face Transformers excelling in CHEBI. The results suggest that the developed information system can effectively train and evaluate NER models for the medical domain. The performance of the models appears comparable to that of the current state of research in several cases and even shows better values in certain areas. However, it is important to note that due to different evaluation strategies, ontology nodes used, and potentially varying data pre-processing methods, direct comparison is subject to significant methodological limitations.

6. Conclusions

This article designed, implemented, and evaluated a system using a generic approach alongside the structured Nunamaker methodology to develop information systems [18]. The system enables medical experts to train, evaluate, and compare ML-based NER models across various NER frameworks.

Section 1 provides an overview of the topic and situates the investigation within its relevant context, laying the foundation for the subsequent analysis. Section 2 offers a comprehensive review of the current state of the art, leading to the identification of three RCs. These RCs involve developing an abstraction layer for ML-based NER, creating a standardized training process for medical professionals, and implementing the Bridge Pattern to construct a cross-platform abstraction layer that efficiently supports the training of NER models in various programming languages. Section 3 addresses these RCs, laying the groundwork for designing an abstraction layer aimed at helping medical experts compare and select ML-based NER frameworks, as well as training NER models using various NER frameworks. Section 4 provides a detailed description of the development of all components of the abstraction layer and illustrates their integration into KM-EP. Section 5 discusses the experimental goals and evaluates the system through quantitative evaluation experiments. Various ontologies of the CRAFT corpus were used to train NER models with different NER frameworks and compare the results with those from the literature. Although full comparability with the experiments in the literature could not be ensured, the system achieved good results relative to those of the literature, suggesting that it provides suitable results for its intended purpose. Thus, it was shown that a generic approach can standardize the training and evaluation of ML-based NER models across various frameworks (RQ1). Furthermore, an abstraction layer for framework-independent training was modeled, developed, and evaluated to facilitate the comparison and selection of NER frameworks (RQ2). In future work, comprehensive qualitative experiments will be conducted with medical experts to evaluate the usability of FIT4NER for domain specialists. These will include detailed user studies and feedback sessions to ensure that the specific requirements and expectations of medical professionals are met. Furthermore, more NER frameworks such as NLTK, Flair, and OpenNLP, along with cloud-based APIs for NER, such as Microsoft Azure Cognitive Services and OpenAI GPT-4, will be integrated. This will expand the range of available NER services and further demonstrate the generalizability of the architecture. Furthermore, FIT4NER’s applicability to other NLP tasks, such as relation extraction and classification, will be explored to showcase its flexibility and adaptability beyond NER. The evaluation process will also be expanded through experiments with various parameters to achieve more detailed comparability with other research efforts. In summary, this study successfully addressed the defined RQs and resolved the challenges identified in Section 2. The results highlight the potential of the prototype to support NER model training for medical professionals.

Author Contributions

Conceptualization, F.F., P.T., F.W. and M.H.; methodology, F.F., P.T., F.W. and M.H.; software, F.F., P.T. and F.W.; validation, F.F., P.T. and M.H.; formal analysis, F.F., P.T. and M.H.; investigation, M.H.; resources, F.F., P.T. and M.H.; data curation, F.F., P.T. and M.H.; writing—original draft preparation, F.F. and F.W.; writing—review and editing, P.T.; visualization, F.F., P.T. and M.H.; supervision, M.H.; project administration, F.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it did not directly involve humans or animals; medical experts are supported in obtaining information.

Informed Consent Statement

Patient consent was waived due to not directly involving humans or animals.

Data Availability Statement

Data is contained within the article .

Conflicts of Interest

The authors declare no conflicts of interest.

References

Freund, F.; Tamla, P.; Hemmje, M. Towards Improving Clinical Practice Guidelines through Named Entity Recognition: Model Development and Evaluation. In Proceedings of the 2023 31st Irish Conference on Artificial Intelligence and Cognitive Science (AICS), Letterkenny, Ireland, 7–8 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
Steinberg, E.; Greenfield, S.; Wolman, D.M.; Mancher, M.; Graham, R. Clinical Practice Guidelines We Can Trust; National Academies Press: Washington, DC, USA, 2011. [Google Scholar]
Kuhlen, R. Informationsethik: Umgang mit Wissen und Information in Elektronischen Räumen; UVK Verlag-Gesellschaft: Munich, Germany, 2004. [Google Scholar]
Wen, C.; Chen, T.; Jia, X.; Zhu, J. Medical Named Entity Recognition from Un-labelled Medical Records Based on Pre-trained Language Models and Domain Dictionary. Data Intell. 2021, 3, 402–417. [Google Scholar] [CrossRef]
Pakhale, K. Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges. arXiv 2023, arXiv:2309.14084. [Google Scholar]
Freund, F.; Tamla, P.; Hemmje, M. Survey: Understand the Challenges of Machine Learning Experts Using Named Entity Recognition Tools. In Proceedings of the Computer Science, Coppenhagen, Denmark, 25–26 January 2025; Volume 15, pp. 115–134. [Google Scholar] [CrossRef]
Bielefeld University. RATIO: Rationalizing Recommendations (RecomRatio). 2017. Available online: https://spp-ratio.de/projects/recomratio/ (accessed on 22 April 2025).
Hemmje, M. Chair of Multimedia and Internet Applications. 2023. Available online: https://www.fernuni-hagen.de/multimedia-internetanwendungen/en/ (accessed on 22 April 2025).
FTK. FTK e.V. Research Institute for Telecommunications and Cooperation. 2023. Available online: https://www.ftk.de/en (accessed on 22 April 2025).
Lamberth-Cocca, S.; Dimanova, V.; Bruchhaus, S.; Nawroth, C.; Mc Kevitt, P.; Hemmje, M. Towards Robust Named Entity Recognition to Support the Extraction of Emerging Technological Knowledge from Biomedical Literature. In Proceedings of the CERC 2023, Barcelona, Spain, 9–10 June 2023. [Google Scholar] [CrossRef]
FTK. Artificial Intelligence for Hospitals, Healthcare & Humanity (AI4H3); R&D White Paper; FTK e.V. Research Institute for Telecommunications and Cooperation: Dortmund, Germany, 2020. [Google Scholar]
Vu, B.; Wu, Y.; Afli, H.; Mc Kevitt, P.; Walsh, P.; Engel, F.; Fuchs, M.; Hemmje, M. A Metagenomic Content and Knowledge Management Ecosystem Platform. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
Tamla, P.; Hartmann, B.; Nguyen, N.; Kramer, C.; Freund, F.; Hemmje, M. CIE: A Cloud-Based Information Extraction System for Named Entity Recognition in AWS, Azure, and Medical Domain. In Knowledge Discovery, Knowledge Engineering and Knowledge Management; Coenen, F., Fred, A., Aveiro, D., Dietz, J., Bernardino, J., Masciari, E., Filipe, J., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 127–148. [Google Scholar]
Freund, F.; Tamla, P.; Reis, T.; Hemmje, M.; Kevitt, P.M. FIT4NER—Towards a Framework-Independent Toolkit for Named Entity Recognition. In Proceedings of the CERC 2023, Barcelona, Spain, 9–10 June 2023. [Google Scholar] [CrossRef]
Tamla, P.; Freund, F.; Hemmje, M. Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval. In The Role of Gamification in Software Development Lifecycle; IntechOpen: London, UK, 2021. [Google Scholar] [CrossRef]
Freund, F.; Tamla, P.; Wilde, F.; Hemmje, M. Making Medical Experts Fit4ner: Transforming Domain Knowledge through Machine Learning-Based Named Entity Recognition. Int. J. Nat. Lang. Comput. 2025, 14. [Google Scholar] [CrossRef]
Wilde, F. Entwicklung einer Microservice-basierten Abstraktionsebene für das Framework-unabhängige Training von Named Entity Recognition in einem Wissensmanagement-System für den medizinischen Bereich. Master’s Thesis, FernUniversität in Hagen, Hagen, Germany, 2025. [Google Scholar]
Nunamaker, J.F., Jr.; Chen, M.; Purdin, T.D.M. Systems Development in Information Systems Research. J. Manag. Inf. Syst. 1990, 7, 89–106. [Google Scholar] [CrossRef]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A Survey on Named Entity Recognition — Datasets, Tools, and Methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
Pagad, N.S.; Pradeep, N. Clinical Named Entity Recognition Methods: An Overview. In Proceedings of the International Conference on Innovative Computing and Communications, Delhi, India, 19–20 February 2022; pp. 151–165. [Google Scholar]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Konkol, I.M. Named Entity Recognition. Ph.D. Thesis, University of West Bohemia, Plzeň, Czech Republic, 2015. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Murray-Smith, D.J. Modelling and Simulation of Integrated Systems in Engineering; Woodhead Publishing Limited: Sawston, UK, 2012. [Google Scholar] [CrossRef]
Raaijmakers, S. Deep Learning for Natural Language Processing; Simon and Schuster: New York, NY, USA, 2022. [Google Scholar]
Developers, T. TensorFlow. Zenodo 2021. [Google Scholar] [CrossRef]
Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.; Peters, M.; Schmitz, M.; Zettlemoyer, L. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv 2018, arXiv:1803.07640. [Google Scholar] [CrossRef]
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; Rush, A.M. OpenNMT: Open-Source Toolkit for Neural Machine Translation. arXiv 2017, arXiv:1701.02810. [Google Scholar] [CrossRef]
Miller, A.H.; Feng, W.; Fisch, A.; Lu, J.; Batra, D.; Bordes, A.; Parikh, D.; Weston, J. ParlAI: A Dialog Research Software Platform. arXiv 2018, arXiv:1705.06476. [Google Scholar] [CrossRef]
Chen, W.; Qiu, P.; Cauteruccio, F. MedNER: A Service-Oriented Framework for Chinese Medical Named-Entity Recognition with Real-World Application. Big Data Cogn. Comput. 2024, 8, 86. [Google Scholar] [CrossRef]
van Rijn, J.N.; Hutter, F. Hyperparameter Importance Across Datasets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; KDD ’18. pp. 2367–2376. [Google Scholar] [CrossRef]
Ruder, S. An Overview of Gradient Descent Optimization Algorithms. arXiv 2017, arXiv:1609.04747. [Google Scholar] [CrossRef]
Jiang, R.; Banchs, R.E.; Li, H. Evaluating and Combining Name Entity Recognition Systems. In Proceedings of the Sixth Named Entity Workshop, Berlin, Germany, 12 August 2016; Duan, X., Banchs, R.E., Zhang, M., Li, H., Kumaran, A., Eds.; pp. 21–27. [Google Scholar] [CrossRef]
Sang, E.F.T.K.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar] [CrossRef]
Mansouri, A.; Affendey, L.S.; Mamat, A. Named Entity Recognition Approaches. Int. J. Comput. Sci. Netw. Secur. 2008, 8, 339–344. [Google Scholar]
Naseer, S.; Ghafoor, M.M.; Alvi, S.b.K.; Kiran, A.; Shafique Ur Rahmand, G.M.; Murtaza, G. Named Entity Recognition (NER) in NLP Techniques, Tools Accuracy and Performance. Pak. J. Multidiscip. Res. 2021, 2, 293–308. [Google Scholar]
Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar] [CrossRef]
Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-strength Natural Language Processing in Python. Zenodo 2020. [Google Scholar]
Jain, S.M. Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems; Apress: Berkeley, CA, USA, 2022. [Google Scholar] [CrossRef]
ElBatanony, A.; Succi, G. Towards the No-Code Era: A Vision and Plan for the Future of Software Development. In Proceedings of the 1st ACM SIGPLAN International Workshop on Beyond Code: No Code, Chicago, IL, USA, 17 October 2021; BCNC 2021. pp. 29–35. [Google Scholar] [CrossRef]
Hu, C. Software Design Patterns. In An Introduction to Software Design: Concepts, Principles, Methodologies, and Techniques; Springer International Publishing: Cham, Switzerland, 2023; pp. 231–275. [Google Scholar] [CrossRef]
Norman, D.A.; Draper, S.W. User Centered System Design; New Perspectives on Human-Computer Interaction; L. Erlbaum Associates Inc.: Mahwah, NJ, USA, 1986. [Google Scholar]
Jacobson, L.; Booch, J.R.G. The Unified Modeling Language Reference Manual; Addison-Wesley Longman Ltd.: London, UK, 2021. [Google Scholar]
Cohen, K.B.; Verspoor, K.; Fort, K.; Funk, C.; Bada, M.; Palmer, M.; Hunter, L.E. The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain. In Handbook of Linguistic Annotation; Ide, N., Pustejovsky, J., Eds.; Springer: Dordrecht, The Netherlands, 2017; pp. 1379–1394. [Google Scholar] [CrossRef]
Basaldella, M.; Furrer, L.; Tasso, C.; Rinaldi, F. Entity Recognition in the Biomedical Domain Using a Hybrid Approach. J. Biomed. Semant. 2017, 8, 51. [Google Scholar] [CrossRef]
Hailu, N.D.; Bada, M.; Hadgu, A.T.; Hunter, L.E. Biomedical Concept Recognition Using Deep Neural Sequence Models. bioRxiv 2019, 530337. [Google Scholar] [CrossRef]
Langer, S.; Neuhaus, F.; Nürnberger, A. CEAR: Creating a Knowledge Graph of Chemical Entities and Roles in Scientific Literature. In Proceedings of the Joint Ontology Workshops (JOWO)—Episode X, Enschede, The Netherlands, 15–19 July 2024. [Google Scholar]
Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcántara, R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: A Database and Ontology for Chemical Entities of Biological Interest. Nucleic Acids Res. 2008, 36, D344–D350. [Google Scholar] [CrossRef]
Diehl, A.D.; Meehan, T.F.; Bradford, Y.M.; Brush, M.H.; Dahdul, W.M.; Dougall, D.S.; He, Y.; Osumi-Sutherland, D.; Ruttenberg, A.; Sarntivijai, S.; et al. The Cell Ontology 2016: Enhanced Content, Modularization, and Ontology Interoperability. J. Biomed. Semant. 2016, 7, 44. [Google Scholar] [CrossRef]
Natale, D.A.; Arighi, C.N.; Blake, J.A.; Bult, C.J.; Christie, K.R.; Cowart, J.; D’Eustachio, P.; Diehl, A.D.; Drabkin, H.J.; Helfer, O.; et al. Protein Ontology: A Controlled Structured Network of Protein Entities. Nucleic Acids Res. 2014, 42, D415–D421. [Google Scholar] [CrossRef] [PubMed]
Mungall, C.J.; Torniai, C.; Gkoutos, G.V.; Lewis, S.E.; Haendel, M.A. Uberon, an Integrative Multi-Species Anatomy Ontology. Genome Biol. 2012, 13, R5. [Google Scholar] [CrossRef] [PubMed]
Bhargava, P.; Drozd, A.; Rogers, A. Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. arXiv 2021, arXiv:2110.01518. [Google Scholar] [CrossRef]
Turc, I.; Chang, M.; Lee, K.; Toutanova, K. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. arXiv 2019, arXiv:1908.08962. [Google Scholar] [CrossRef]

Figure 1. Process model for AI-based knowledge extraction support for CPG development [1].

Figure 2. Use cases for Model Definition User and Model End User [16].

Figure 3. CDD-based abstraction of NER model training.

Figure 4. Activity diagram of the generic training process.

Figure 5. FIT4NER component diagram [16].

Figure 6. FIT4NER distribution diagram and integration with KM-EP.

Figure 7. OpenAPI specification of the NER Framework Independent Service.

Figure 8. KM-EP GUI for selecting NER frameworks and models [16].

Table 1. Number of NE annotations in CRAFT.

ChEBI
NE	Training	Evaluation	Total
chemical entity	4611	1108	5719
role	828	229	1056
subatomic particle	79	18	98
Total	5518	1355	6873
CL
NE	Training	Evaluation	Total
electrically active cell	506	117	623
eukaryotic cell	2968	748	3716
hematopoietic cell	250	61	411
motile cell	269	64	333
nucleate cell	27	6	33
secretory cell	122	35	157
Total	4242	1031	5273
PR
NE	Training	Evaluation	Total
protein	15,668	3919	19,587
UBERON
NE	Training	Evaluation	Total
anatomical collection	165	35	200
anatomical structure	13,155	3208	16,363
developing anatomical structure	231	66	297
organism substance	562	160	722
Total	14,113	3469	17,582

Table 2. Quantitative evaluation of NER framework services with CRAFT (ChEBI/CL).

NER Framework	ChEBI			CL
	P	R	F1	P	R	F1
Basaldella et al. [47]	89	73	77	95	88	91
Langer et al. [49]	93.4	85.1	89.0	-	-	-
Hailu et al. [48]	88	61	72	86	77	81
spaCy	93.57	87.19	90.27	92.48	88.07	90.22
CoreNLP	87.06	63.96	73.74	94.23	78.53	85.66
Hugging Face Transformers	86.89	93.41	90.03	77.61	78.96	78.28

Table 3. Quantitative evaluation of NER framework services with CRAFT (PR/UBERON).

NER Framework	PR			UBERON
	P	R	F1	P	R	F1
Basaldella et al. [47]	86	84	80	-	-	-
Hailu et al. [48]	66	38	48	81	76	79
spaCy	96.43	84.14	89.87	92.29	83.08	90.11
CoreNLP	76.65	70.48	72.97	77.46	91.85	84.04
Hugging Face Transformers	82.11	81.08	81.57	86.8	81.81	84.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Freund, F.; Tamla, P.; Wilde, F.; Hemmje, M. A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition. Information 2025, 16, 554. https://doi.org/10.3390/info16070554

AMA Style

Freund F, Tamla P, Wilde F, Hemmje M. A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition. Information. 2025; 16(7):554. https://doi.org/10.3390/info16070554

Chicago/Turabian Style

Freund, Florian, Philippe Tamla, Frederik Wilde, and Matthias Hemmje. 2025. "A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition" Information 16, no. 7: 554. https://doi.org/10.3390/info16070554

APA Style

Freund, F., Tamla, P., Wilde, F., & Hemmje, M. (2025). A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition. Information, 16(7), 554. https://doi.org/10.3390/info16070554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition^†

Abstract

1. Introduction and Motivation

2. State of the Art in Science and Technology

3. Modeling

4. Implementation

5. Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition †

Abstract

1. Introduction and Motivation

2. State of the Art in Science and Technology

3. Modeling

4. Implementation

5. Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

A FIT4NER Generic Approach for Framework-Independent Medical Named Entity Recognition^†