This section describes the approach, which integrates classical techniques for NER and deep learning methods for automatic relation extraction. The primary objective of this approach is to construct and improve a medical knowledge graph. This knowledge representation integrates relevant information extracted from the abstracts of scientific articles.
3.1. Medical Scientific Texts
The initial knowledge is obtained from a set of scientific articles belonging to a specialized collection, as referenced in [
20]. This dataset is part of the MESINESP (Medical Semantic Indexing in Spanish) [
21] task, whose objective is to promote the development of semantic indexing tools for biomedical content in languages other than English. All documents were annotated with DeCS (Descriptors in Health Sciences) [
22] descriptors, a structured and controlled vocabulary developed by BIREME (Latin American and Caribbean Center on Health Sciences Information) [
22] to index scientific publications in BVSalud (Virtual Health Library) [
23].
Plain text documents undergo a preprocessing procedure. The cleaning process is informed by exploratory analysis. The objective is to identify and preserve information that may be of medical significance. The cleaning process entails the removal of punctuation marks, including periods, commas, semicolons, underscores, slashes, brackets, parentheses, colons, and both Spanish and English quotation marks. Furthermore, regular expressions are employed to suppress numbers and digits combined with special characters, such as dots, slashes, hyphens, plus signs, and hashtags. Additionally, the utilization of symbols in conjunction with signs, such as plus/minus, equal, greater than, less than, percentage, and the ampersand, is crucial for effective communication. This analysis enables the flexible management of terms, allowing for their complete preservation or omission, while ensuring the accurate identification of named medical entities. For instance, the term “COVID-19” is not altered thanks to the regular expressions used and the analysis of the signs that can be removed.
Figure 2 shows an example of a text in its original form and the result of applying the preprocessing techniques described.
3.3. Medical Relation Extraction
In the medical relationship extraction phase, the previously processed texts are segmented. Segmentation involves identifying and extracting all possible sentences that contain at least two recognized medical entities. The restriction is that the distance between them must not exceed 20 words.
The identified entities are masked within each sentence using specific tags according to their type. The acronyms @ENFE@, @ANAT@, and @MEDI@ are used for diseases, anatomy, and drugs, respectively. Masking standardizes the representation of entities to facilitate the detection of medical relationships. Each extracted sentence contains all the words between the two entities, along with a context window of five words on both sides.
All combinations of entities are evaluated to derive several candidate sentences from the same text fragment. The entities involved in a relationship are hidden in each sentence, preserving the rest of the content.
The sequences of terms B and I are grouped into a single hidden term. For example, for a disease, all relevant labels (B-ENFE and I-ENFE) were systematically replaced by @ENFE@. The original term is retained for further analysis and representation.
Figure 4 shows this process.
In
Figure 5, the sentence segmentation process from the original text is presented, in conjunction with entity masking.
The purpose of entity masking is to enable the model to identify specific medical relationships between two entities. It also prevents the introduction of other medical terms that may interfere with the context. This highlights the importance of local context and facilitates the accurate management of compound medical terms. In the example in
Figure 5, only the entities involved in each sentence are masked, and the rest of the text remains unchanged.
Masking is based on Harris’s distributional hypothesis [
26], which posits that words that appear in similar contexts tend to have similar meanings. Therefore, the coincidence of two nearby medical entities suggests the likelihood of a shared semantic context, indicating a possible relationship between them. This principle is reinforced by LLMs that utilize attention mechanisms to assign greater weight to nearby tokens, thereby facilitating the identification of meaningful relationships.
Segmentation was performed on 990 articles from the BioASQ corpus, extracting 1912 sentences with a minimum of two medical entities (ANAT, ENFE, or MEDI). Three medical experts manually annotated the semantic relationships between entities. The annotation system is binary, assigning a value of 1 to indicate a valid relationship and 0 in the case of an invalid relationship.
The resulting dataset, accessible in [
27], is used to train an LLM for the binary classification of medical relations. In total, 80% of the phrases are allocated for training and the remaining 20% for evaluation. A maximum precision of 90.6% is obtained using MedicoBERT [
3] as the base model.
According to the authors of MedicoBERT, the model was pre-trained on two different tasks: masked language modeling (MLM) and next sentence prediction (NSP). These tasks were performed using a dataset consisting of more than three million medical texts in Spanish. The corpus is made up of three different biomedical datasets recognized by the research community. The BioASQ, CoWeSe, and CORD-19 datasets comprise approximately 1.1 billion words in total.
The MedicoBERT tokenizer, developed to process medical texts with a vocabulary of over 50,000 specialized tokens, is also used. This tokenizer enables accurate and contextual representation of medical terms to improve performance on specific tasks.
Initially, a preliminary adjustment of the hyperparameters is performed using an exploratory approach to identify a candidate range for each parameter. This phase involved a literature-guided search adapted to the context of medical relationship identification.
The subsequent adjustment stage employed Bayesian optimization, which constructs a probabilistic surrogate model (typically a Gaussian process or GP). The model’s performance is estimated across a range of hyperparameter configurations. The search is guided by an acquisition function designed to balance exploration and identify a global optimum. A random function represents the model’s performance, as shown in Equation (
1).
where
is the expected model performance and
is a kernel representing similarity between configurations.
The expected improvement (EI) acquisition function is used to select the next set of hyperparameters to be evaluated. Equation (
2) represents the expected value of the improvement in the objective function
with respect to the best known value
. This improvement is only considered when it is positive; otherwise, a value of 0 is assigned. Thus, points that do not exceed the current value do not penalize the result but do not contribute to the expected value. This behavior allows a balance to be achieved between exploitation and exploration during the optimization process. Points with a high expected mean corresponding to the exploitation of acquired knowledge are favoured. Conversely, points with high uncertainty in the prediction are valued, which encourages the exploration of potentially promising regions of the search space.
The trained model is designed to identify the presence of medical relationships in sentences with two masked entities. The entities involved can be diseases, anatomy, or medications. The model output has a binary structure, where 1 indicates the presence of a relationship and 0 means its absence.
Relationships annotated as valid are assigned rules defined as semantically consistent.
Table 2 presents the six types of rules that demonstrate the semantic validity of the relationships transferred to the knowledge graph.
In addition, a hierarchical relationship is defined for cases in which both entities are of the same type, using the generic relationship
“es_un” (is_a), as shown in
Table 3.
Relationships enable the construction of informative triplets in the form of a relationship (entity A, entity B). The new processed texts have a verification mechanism to avoid duplication of generated triplets. The identification of specific relationships is verified along with the updating of the hierarchical relationship to ensure semantic enrichment.
For example, if a relationship is found between dolor de cabeza (headache) and paracetamol (paracetamol) (ENFE and MEDI), the triplets generated are as follows:
es_tratamiento (paracetamol, dolor de cabeza)
is_treatment_for (paracetamol, headache);
es_un (paracetamol, medicamento)
is_a (paracetamol, drug);
es_un (dolor de cabeza, enfermedad)
is_a (headache, disease).
The triplets reflect the specific semantic relationships es_tratamiento (is_treatment_for) and es_un (is_a). This integration of contextual links with structured representations in the knowledge graph facilitates the interpretation of medical information.
Figure 6 illustrates a section of the knowledge graph, highlighting the relationships from the example mentioned above.
In a knowledge graph, triplets are represented by a network of nodes and edges. In this context, the nodes correspond to medical entities and the edges represent relationships. The final knowledge graph contains 4355 nodes, 2217 diseases, 969 drugs, and 1169 anatomy, as well as 12,294 extracted relationships.
Figure 7 shows a knowledge subgraph with nodes of each entity type (disease, anatomy, and medicine) to facilitate visual analysis and interpretation.