Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

Batura, Tatiana; Yerimbetova, Aigerim; Mukazhanov, Nurzhan; Shvarts, Nikita; Sakenov, Bakzhan; Turdalyuly, Mussa

doi:10.3390/app15169086

Open AccessArticle

Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

by

Tatiana Batura

^1,2

,

Aigerim Yerimbetova

^1,3,*

,

Nurzhan Mukazhanov

^1,4

,

Nikita Shvarts

^1,2,

Bakzhan Sakenov

¹

and

Mussa Turdalyuly

^1,4,*

¹

Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan

²

A.P. Ershov Institute of Informatics Systems, Novosibirsk 630090, Russia

³

Global Education and Training, University of Illinois Urbana-Champaign, Champaign, IL 61801, USA

⁴

Department of Software Engineering, Satbayev University, Almaty 050013, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9086; https://doi.org/10.3390/app15169086

Submission received: 20 July 2025 / Revised: 3 August 2025 / Accepted: 7 August 2025 / Published: 18 August 2025

Download

Browse Figure

Versions Notes

Abstract

The rapid growth of scientific literature necessitates effective information extraction. However, existing methods face significant challenges, particularly when applied to multi-domain documents and low-resource languages. For Kazakh and Russian, there is a notable lack of annotated corpora and dedicated tools for scientific information extraction. To address this gap, we introduce SciMDIX (Scientific Multi-Domain Information extraction), a novel multi-domain dataset of scientific documents in Russian and Kazakh, annotated with entities and relations. Our study includes a comprehensive evaluation of entity recognition performance, comparing state-of-the-art models, such as BERT, LLaMA, GLiNER, and spaCy across four diverse domains (IT, Linguistics, Medicine, and Psychology) in both languages. The findings highlight the promise of spaCy and GLiNER for practical deployment in under-resourced language settings. Furthermore, we propose a new zero-shot relation extraction model that leverages a multimodal representation by integrating sentence context, entity mentions, and textual definitions of relation classes. Our model can predict semantic relations between entities in new documents, even for a language encountered during training. This capability is especially valuable for low-resource language scenarios.

Keywords:

term extraction; named entity recognition; relation extraction; natural language processing; language models; information extraction

1. Introduction

The rapid growth of scientific literature across various domains has created an overwhelming amount of information, making it increasingly challenging for researchers to extract relevant knowledge and insights from the vast amounts of available data. The ability to effectively extract information from multi-domain scientific documents is crucial for advancing research and facilitating knowledge discovery. Recent advances in natural language processing (NLP) and machine learning have paved the way for the development of automated information extraction methods. These methods can efficiently process large volumes of scientific texts and identify relevant information [1].

Numerous studies have demonstrated the effectiveness of these methods in extracting information from scientific documents, including named entity recognition (NER) and relation extraction (RE) [2]. However, the complexity and diversity of multi-domain scientific documents present considerable challenges to existing information extraction methods. This requires the development of more sophisticated and adaptable approaches [3]. In contrast to the abundant availability of annotated corpora and text processing tools for English and Chinese, the public domain currently offers a very limited number of such resources for Kazakh. Moreover, the subject of information extraction from scientific texts has received scant attention from researchers, both in the Kazakh context and, to a lesser extent, in the Russian context. This underscores the scientific and practical significance of the present research.

While over 100 publicly available datasets exist for English in the field of NLP and Information Extraction (IE), including well-known benchmarks such as ACE2005 [4], TACRED [5], and SciERC [6]-the situation is markedly different for other languages. For Russian, fewer than 10 scientific IE datasets are publicly available. Most of them are limited in scale or domain coverage, with resources like RuSERRC [7] and NEREL [8] representing some of the only large-scale efforts in this area.

In contrast, for Kazakh, only 2 to 3 general-purpose NER datasets are publicly available, most notably KazNERD [9], which contains 112,702 sentences and 136,333 entity annotations. However, no open-access scientific corpora annotated for RE or domain-specific NER tasks currently exist for Kazakh. This lack of annotated scientific texts and tools makes research in Kazakh-language IE especially challenging and highlights the importance and novelty of developing resources and models for under-resourced languages.

The primary contributions of this study are as follows:

A new dataset of multi-domain scientific documents in Russian and Kazakh has been created, annotated with entities and relations.
A novel method for building a language model for relation extraction has been developed utilizing zero-shot learning techniques. This method performs effectively in the absence of substantial training data, making it particularly valuable for low-resource languages, such as Kazakh.

The remainder of this paper is organized as follows. Section 2 reviews related work on the subject of scientific information extraction. Section 3 provides a detailed description of the annotation process used in the analysis of the data. Section 4 and Section 5 present the methodology, experimental setup, and results for entity recognition. Section 6 details the experimental setup and the results obtained from the application of RE methods. The subject of Section 7 is the discussion of the insights and limitations. Finally, Section 8 concludes the paper and outlines future work.

2. Related Work

IE from multi-domain scientific documents is a crucial area of research, enabling automated knowledge discovery, entity linking, and relation extraction across various disciplines. Numerous datasets have been developed to support these tasks, covering domains such as biomedical literature, artificial intelligence (AI), and machine learning (ML). These resources typically include annotations for NER, RE, coreference resolution, and event extraction.

One of the most prominent datasets in scientific text processing is SciERC, which focuses on identifying scientific entities, extracting inter-entity relations, and resolving coreference chains. The dataset consists of annotated scientific abstracts from AI, NLP, and ML research, and has become a benchmark for evaluating scientific IE models. Complementing this, SciER [2] provides full-text publications in the ML and AI domains with manual annotations for entities and relations, supporting more granular and document-level information extraction.

While the development of English-language scientific IE datasets has been extensive, the situation for other languages is notably different. For Russian, the NEREL-BIO [10] dataset provides annotations across general and biomedical domains. Baseline NER models evaluated on this dataset have demonstrated strong results, with F1-scores ranging from 70% to 83%, depending on entity type and model architecture.

The challenges are even more pronounced for Kazakh, where the scarcity of domain-specific datasets and pre-trained language models significantly hinders progress. KazNERD [9] remains the most prominent resource, though no corpora for relation extraction or domain-specific NER tasks are currently available.

To address these gaps, recent research has explored deep learning and morphology-aware approaches tailored to the agglutinative nature of Kazakh. A hybrid neural model integrating word semantics, morphological embeddings, and graph attention networks has shown promising results in general-domain NER, achieving an F1-score of 88.04% [11]. Similarly, the inclusion of root/entity embeddings and tensor layers in neural architectures has helped mitigate data sparsity and improve performance for morphologically rich languages like Kazakh [12], with reported F1-scores exceeding 85% on KazNERD subsets.

A recent review of Kazakh morphological analysis methods highlights the differences between traditional machine learning models—such as Conditional Random Fields (CRFs) [13]—and modern deep learning approaches, including RNNs, BERT [14], and Transformer-based models. It emphasizes the need for Kazakh-specific adaptations in model architecture and training data [15]. Studies comparing morphological parsers indicate that RNNs often outperform Transformers, achieving F1-scores between 85% and 89%, likely due to their compatibility with the language’s morphological complexity [16].

Multi-task learning approaches have also been applied successfully in low-resource settings. For instance, integrating auxiliary tasks into NER training has yielded improved performance for Kazakh with limited annotated data, achieving F1-scores in the range of 84–87% [17].

Despite these advances, scientific IE for both Russian and Kazakh remains fragmented and limited in scope. While Russian has begun to build domain-specific resources, Kazakh research remains focused largely on general-domain NER. The lack of annotated corpora, domain-specific datasets, and robust pre-trained models continues to be a major barrier, underscoring the urgent need to develop new resources and tools for these under-resourced languages.

3. Data Preparation

The methodology developed for data annotation for the tasks of entity and relation extraction consists of the following steps. Firstly, abstracts of scientific papers published in open-access journals from 2018 to 2024 were collected. These journals were recommended by the National Committee of the Republic of Kazakhstan to publish primary scientific research results. As the abstracts in such journals are written in Kazakh and Russian, both languages were collected.

The texts encompass four knowledge domains: Information Technology (Al-Farabi Kazakh National University. Journal of Mathematics, Mechanics and Computer Science. CC BY-NC 4.0 license. https://bm.kaznu.kz/index.php/kaznu/issue/archive (accessed on 17 July 2025)), Linguistics (Al-Farabi Kazakh National University. KazNU Bulletin. Philology Series. CC BY-NC 4.0 license. https://philart.kaznu.kz/index.php/1-FIL/issue/archive (accessed on 17 July 2025)), Medicine (Semey Medical University. Science and Healthcare. CC-BY 4.0 license. https://newjournal.ssmu.kz/en/publication/releases/ (accessed on 17 July 2025)) and Psychology (Al-Farabi Kazakh National University. Journal of Psychology and Sociology. CC BY-NC 4.0 license. https://bulletin-psysoc.kaznu.kz/index.php/1-psy/issue/archive (accessed on 17 July 2025)). The selection of these domains was guided by the goal of creating a diverse and representative dataset for evaluating performance in multi-domain settings. These domains were chosen to capture a broad spectrum of language use, conceptual structures, and knowledge representation styles, ensuring that the evaluation covers a wide range of challenges typically encountered in real-world applications of NLP and knowledge-based systems.

Information Technology reflects highly technical and structured content, characterized by frequent use of abbreviations and standardized terminologies. It represents domains with rapidly evolving vocabularies. Linguistics is a meta-domain concerned with the structure and function of language itself. It introduces specialized terminology that is conceptually abstract and often overlaps with other domains (e.g., cognitive science). The Medicine domain is representative of high-stakes, evidence-based fields where precision and disambiguation are critical. Medical texts are rich in domain-specific entities and hierarchical taxonomies, making them an essential test case for semantic understanding. The Psychology domain bridges the sciences and humanities. It includes both clinical terminology and theoretical constructs. It represents fields with complex conceptual relations and less standardized terminologies.

Secondly, our objective was to propose a relatively universal annotation scheme suitable for any domain of knowledge. The selection of entity and relation types was guided by principles of maximum expressiveness and non-redundancy. The annotation process was conducted in two stages. Initially, entities were annotated separately from relations.

3.1. Entity Annotation

The collected corpus was annotated for two categories of entities: scientific terms (TERM) and numerical values (VALUE). Terms are defined as words or phrases used in a specific domain to precisely denote particular concepts, phenomena, or objects. To illustrate this point, consider the field of Information Technology, where terms encompass the nomenclature of methods, architectures, models, programming languages, and related concepts. Similarly, in the domain of Medicine, terminology includes the nomenclature of diseases, symptoms, medications, diagnostic procedures, and other relevant concepts. It is important to note that abbreviations are also considered terms. Entities of the VALUE type are defined as numerical values accompanied by supplementary information, such as context or units of measurement. These quantitative or qualitative indicators are employed to describe specific data that can be measured or evaluated.

In the initial phase of the project, entities in the source texts were annotated using the “gpt-4o-mini” model [18]. The methodology employed was predicated on cross-lingual transfer learning with a one-shot learning technique. The LLM prompt comprised an English-language instruction and a document with entity annotations in a specified format (see Appendix A). Each entity is assigned a unique identifier. Entities are annotated in the following format: [Entity|Identifier|Label]. The label TERM indicates that the highlighted set of words is a term, while the label VALUE is used for numerical values. This markup format has been demonstrated to assist in the reduction of expenditure pertaining to API calls to LLMs.

The annotations were manually corrected, and each text was independently reviewed by two annotators using Label Studio. The moderator proceeded to address any residual ambiguities by referring to the annotation guidelines (https://github.com/tvbat/sci-text-miner-scimdix (accessed on 17 July 2025)). An annotator instruction comprises a comprehensive description of various cases and numerous illustrative examples with explanatory notes for the domains of knowledge and languages under consideration. The consistency of the annotation was calculated using a standard statistical measure, namely Cohen’s kappa coefficient [19]. The mean value obtained was 0.73, which is indicative of high annotation quality. The statistics for our dataset are presented in Table 1 and Table 2, respectively.

Specialized frequency dictionaries of terms for each domain have been compiled. The total number of unique terms is 3,949 for Russian and 4,594 for Kazakh. Figure 1 presents detailed information organized by domain. As can be observed, the Medical domain shows a greater discrepancy in the number of unique terms between Kazakh and Russian. It may presumably be attributed to the linguistic features of the Kazakh language. For example, due to differences in word formation, a single medical term in Russian might correspond to multiple equivalents in Kazakh. Alternatively, because of semantic granularity, one term in Russian may cover a broad concept, whereas in Kazakh, distinct terms might be used for different aspects of that concept.

3.2. Relation Annotation

The initial step in the annotation process for relations between two entities within a text (document-level relations) is the identification of so-called morphological clusters. In the event of the same term appearing in different word forms (for example, in singular and plural forms or in different grammatical cases, which is relevant for Kazakh and Russian) or multiple times in the text, the identifiers of that term and its various word forms are merged into a single cluster. This stage of data preparation is referred to as deduplication. The execution of the process was conducted in an automated manner, employing the “gemini-2.0-flash” model. In the subsequent phase, the process of automatic relation annotation was executed, once more employing the “Gemini-2.0-Flash” model. For the purpose of this study, both languages were used, with the prompts for the LLM being written in English. In addition to the instructions, the prompt comprised a list of relation types accompanied by their respective descriptions, as well as exemplars of the required format for both input and output data. The prompt used for relation annotation is provided in the Appendix B. The final stage of the data preparation process for relation extraction entailed the manual correction of the automatic annotations.

The annotation scheme of semantic relations includes six types: The following terms are to be noted: ’HAS_CHARACTERISTIC‘, ’HAS_PART‘, ’HAS_USE‘, ’HAS_VALUE‘, ’SUBCLASS_OF‘, and ’SYNONYM‘. It is important to note that a single entity can be involved in multiple relations concurrently. In the event of a sentence comprising multiple homogeneous entities that are semantically linked to another entity, it is essential that the relation is specified for each such pair. It is important to note that multiple relations may appear in a single sentence. The selection of relation classes was made in accordance with the following principles:

The relation should link terms within scientific texts.
The relation should have an unambiguous interpretation.
The relation types should be broadly universal to cover any knowledge domain.
The relation names are taken from Wikidata (https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all (accessed on 17 July 2025)), since we plan to link terms with Wikidata entities in the future.

Table 3 presents the relation types with descriptions and Wikidata properties. It is important to note that the HAS_CHARACTERISTIC relation differs from the HAS_PART relation in that HAS_PART (a meronymy relation) refers to components, not attributes. It is evident that ’HAS_CHARACTERISTIC’ exhibits a divergence from ’SUBCLASS_OF’ due to the nature of the latter, which is predicated on a hyponymy relation, signifying class membership rather than properties. Table 4 and Table 5 provide the distribution of relation types across knowledge domains for both languages.

4. Entity Recognition Methods

In this study, we have elected to utilize the following methods as baselines for the purpose of entity recognition: BERT, LLaMA, spaCy and GLiNER. To evaluate entity recognition models, the SciMDIX dataset was randomly split into 80% for training and 20% for testing.

4.1. BERT-Based Model

The present study adopted the method based on ruBERT [20], obtaining the optimal results for the extraction of scientific terms from texts in Russian [7]. The method combines a BERT model pre-trained on Russian texts with scientific term dictionaries collected in a semi-automatic manner, along with some heuristics. It is evident that these heuristics encompass a number of key principles. Firstly, it is evident that there is an absence of verbs and adverbs in terms. Secondly, prepositions are removed as the first token of a term. It should be noted that there are several other factors to be considered.

4.2. LLaMA-Based Model

The present study investigates the refinement of [21] with 3B parameters, representing one of the most recent iterations of a publicly accessible large language model (LLM). In the training phase, the AdamW optimizer with 8-bit precision was employed in order to accelerate computations. Furthermore, the Low-Rank Adaptation (LoRA) technique [22] was implemented. This method facilitates the optimization of LLMs through the reduction of the number of parameters necessary for model adaptation. This, in turn, minimizes the demands on memory and computational resources.

The model underwent training in four distinct domains: Information Technology, Linguistics, Medicine, and Psychology. Initially, each domain was trained separately, followed by joint training across both domains. The learning rate was set to

2 \times 10^{- 4}

, with a batch size of 2. The training was conducted for 2 epochs, with a maximum of 180 steps per epoch. The experiments were conducted using a single NVIDIA Tesla A100 GPU with 80 GB of memory.

4.3. GLiNER

Furthermore, experiments were conducted utilising GLiNER (Generalist Model for Named Entity Recognition using Bidirectional Transformer) [23]. This model employs a smaller-scale bidirectional language encoder as opposed to the utilization of large autoregressive models. The proposed methodology treats NER as a matching task between entity type embeddings and textual span representations in latent space, rather than a generation task. This approach naturally resolves scalability issues and enables bidirectional context processing, leading to richer and more contextualized representations. The model displays remarkable cross-lingual robustness, thereby emphasizing its strong generalization capabilities.

4.4. SpaCy Model

As one of the NER baselines, we considered a CNN-based model with the spaCy tok2vec embedding layer [24]. This model produces token-level vector representations that capture both meaning and context. Furthermore, it is characterized by its capacity for facile adaptation to diverse linguistic and knowledge domains.

The spaCy tok2vec model was evaluated on the prepared dataset through two different experiments. In the initial instance, the training and testing processes were executed in a manner that was distinct for each individual domain-specific dataset. In the second case, all domain datasets were combined for joint training and testing. The training parameters that were determined to be optimal for this model on the specified dataset are outlined below: learning rate = 0.001, batch size = 1000, number of epochs = 100, eval_frequency(steps) = 200, dropout = 0.1. During the training process, overfitting was observed at epoch 59 (corresponding to 11,800 iteration steps) on the Russian data and at epoch 43 (corresponding to 8,600 steps) on the Kazakh data. The Adam optimizer was utilized to optimize the parameters. Among the models, spaCy tok2vec demonstrated the best overall performance.

5. Results of Entity Recognition

A comprehensive analysis of model effectiveness, sensitivity to language, and the impact of domain characteristics was conducted by analyzing the results obtained during experiments on entity recognition across four models (BERT, LLaMA, GLiNER, and spaCy). These models were evaluated on two languages (Kazakh and Russian) and four domains (IT, Linguistics, Medicine, and Psychology).

The BERT-based model (see Table 6) demonstrates a strong dependency on language resources. The model performs significantly better on Russian, with an average F1 score of 68.88% across all domains, compared to only 34.50% for Kazakh. This discrepancy likely reflects the availability of richer pretrained embeddings and annotated data for Russian. Among the individual domains, BERT achieves its highest performance in the Linguistics domain for Kazakh (F1 = 41.37%) and in the IT domain for Russian (F1 = 74.71%). However, the overall results for Kazakh remain modest, highlighting the limitations of BERT in low-resource language settings.

The LLaMA model (see Table 7) provides a noticeable improvement over BERT, particularly for Kazakh. The average F1 score for Kazakh increases from 34.50% (BERT) to 53.36% (LLaMA), while Russian maintains a slightly higher average of 54.44%. Although the gap between languages persists, it is narrower than with BERT, suggesting that LLaMA is more adaptable to under-resourced contexts. These findings indicate that LLaMA can generalize better across domains and languages compared to BERT.

GLiNER results (see Table 8) demonstrate a significant enhancement in performance, particularly with regard to the Kazakh language. The mean F1 score for Kazakh is 75.37%, while Russian achieves a mean score of 79.65%. GLiNER demonstrates particular efficacy in the domains of Linguistics and Medicine for the Russian language and in the domain of Linguistics for Kazakh, indicating its proficiency in semantically rich and structurally regular texts. Furthermore, the model maintains a satisfactory balance between precision and recall, indicating consistent entity recognition across domain-specific variations.

The spaCy model (see Table 9) outperforms all other models by a large margin, achieving state-of-the-art results across both languages and all domains. For Kazakh, spaCy attains an average F1 score of 96.84%, while for Russian it reaches 97.36%. The model maintains high scores in both precision and recall, reflecting its robustness and reliability. Particularly noteworthy is its performance in the Linguistics+IT domain, where it achieves an F1 score of 97.94% for Kazakh and 98.30% for Russian. These results demonstrate spaCy’s strong ability to generalize across different languages and specialized domains, making it the most effective model in this study.

To sum up the findings, while BERT encounters challenges with low-resource languages such as Kazakh, LLaMA demonstrates moderate enhancements, and GLiNER exhibits substantial gains, particularly in domains characterized by linguistic richness. Notably, spaCy achieves near-perfect scores across various evaluation metrics, indicating minimal performance disparity between Kazakh and Russian. Furthermore, the study emphasizes the viability of spaCy and GLiNER for implementation in practical applications involving under-resourced languages.

6. Relation Extraction Methods

The extraction of relations is a key component in the process of structuring knowledge from scientific texts. In academic writing, semantic connections between entities such as concepts, terms, authors, methods, and findings are often implicitly expressed within varying contexts and across different levels of text granularity, ranging from individual sentences to whole paragraphs. This inherent complexity renders the extraction of such relations a particularly arduous task. The issue is further compounded in low-resource language settings, such as Kazakh, where there is a paucity of annotated corpora available. In order to address these challenges, a relation classification model based on zero-shot learning is employed, which allows for the extraction of meaningful relations even in the absence of training examples for target relation types, even in another language.

There are a number of approaches to the classification of relations between entities. The prevailing methodologies in this field depend on the utilization of supervised learning over fully annotated corpora, frequently employing transformer-based architectures. In contrast, alternative approaches involve the exploration of generative models that generate textual descriptions of relations given a contextual input. However, such techniques generally require substantial amounts of annotated data, which restricts their applicability in low-resource languages or niche scientific domains. Zero-shot learning constitutes an alternative strategy, whereby auxiliary semantic information is employed to facilitate the generalization of models to unseen classes or languages. In this particular instance, textual descriptions of relation types are utilized as the auxiliary semantic information. The latter approach is implemented in the model that was used in our study to transfer knowledge from Russian to Kazakh while working with scientific texts.

In order to evaluate the applicability of the proposed model for the classification of scientific relations, a series of experiments was conducted using SciMDIX, a newly introduced dataset in this paper containing Russian and Kazakh scientific documents annotated with entities and semantic relations.

6.1. Data Preprocessing

Transformer-based language models function as the fundamental encoding framework. These models, such as BERT [14], E5 [25] have achieved a high level of recognition for their capacity to generate context-sensitive representations that are able to account for the syntactic and semantic relations between words in a sentence. Vector representations are computed for each token in the sequence, from which three key components are extracted: firstly, the [CLS] token, which captures the global sentence context; secondly, the averaged embedding of all tokens belonging to the first entity; and thirdly, the averaged embedding for the second entity. This explicit bracketing serves to guide the attention mechanism of the model, focusing it on entity-relevant parts of the input. These three vectors are then concatenated and processed through fully connected neural layers with non-linear activations, resulting in a unified feature vector that encodes both global and entity-specific information.

6.2. Model Architecture

This section presents a novel relation extraction method distinguished by its integration of transformer-based encoding, explicit entity marking, and multimodal learning. This integration yields a model capable of effectively generalizing to unseen relation types, adapting to new domains, and facilitating cross-lingual transfer.

Unlike the study [26], whose authors propose a Relation Contrastive Learning (RCL) method for zero-shot relation extraction, the method proposed in this study, in addition to RCL, uses multimodal feature spaces, including textual descriptions of relation classes. Both methods enhance the zero-shot capability of models but use different strategies: RCL focuses on separating classes in a learned high-dimensional space, while our method uses integrated semantic representations for improved relation extraction. The proposed method for building the RE model consists of the following main steps:

Constructing a unified feature vector representation leveraging entity embeddings and contextual information (see Section 6.1).
Generating contextualized vector representations of textual descriptions of relations using a pre-trained Transformer model (such as BERT or E5).
Building a shared feature space for two modalities using cross-entropy minimization or triplet loss techniques. The selected strategy is based on a multimodal representation that integrates both textual context and class-level semantics to construct a unified feature space for relation classification. Specifically, it combines the distributed representations of the sentence context, the two entity mentions, and the textual definitions of possible relation classes. These elements are encoded in unison to form rich, transferable semantics that support zero-shot inference by projecting both seen and unseen relation types into a shared representation space.

6.3. Training and Evaluation Details

The RE model was evaluated in two learning settings: a standard multi-class classification scenario and a zero-shot evaluation setup. In the first scenario, the model was trained and tested on the Russian-language subset of SciMDIX, which was initially split into training and test sets with an 80/20 ratio, respectively. In the second scenario, the model was trained on the same training set (consisting of texts in Russian) and applied to a cross-lingual relation classification task in Kazakh. The presented results in Table 10 demonstrate the performance of several baseline and advanced models in both fully supervised (Russian → Russian) and cross-lingual zero-shot (Russian → Kazakh) relation classification scenarios, measured by F1-score.

Across the supervised setting, the R-BERT [27] model shows the highest F1-score (0.687), which is consistent with its design for fine-grained token-level relation extraction using explicit entity markers and full supervision. However, its performance significantly deteriorates in the cross-lingual scenario (0.564), highlighting a common limitation in traditional supervised models when applied to unseen languages.

By contrast, the proposed model shows highly balanced performance across both settings. It achieves 0.639 F1-score on the Russian test set (second-best overall) and reaches the highest F1-score (0.640) under the zero-shot setting. This finding indicates that the model generalizes well beyond language boundaries, likely due to its use of textual class descriptions and multimodal representation learning that aligns entity-context information with semantic type definitions in a shared embedding space.

Among alternative approaches, we observe that E5-based models perform surprisingly well in cross-lingual adaptation (0.637), outperforming their performance in the supervised setting and nearly matching the results of the proposed method. This suggests that encoder-only models pretrained with task-agnostic objectives may yield useful representations for transfer learning. However, the proposed model remains the only one showing consistently high values in both scenarios, without substantial performance degradation, making it a robust option for low-resource and cross-lingual scientific information extraction.

These results underscore the importance of model architecture and training strategy in designing systems for multilingual relation extraction. In particular, leveraging external semantic knowledge through textual descriptions and multimodal integration appears to be a key factor in achieving reliable generalization to unseen relation types and languages.

In addition to its primary function of relation classification, the proposed model demonstrates potential for use in automatic or semi-automatic annotation of relations within newly developed scientific corpora. Its zero-shot architecture, combined with the ability to interpret textual descriptions of relation types, enables the model to identify likely semantic connections between entities in previously unseen domains. This makes it particularly valuable in contexts involving domain adaptation and limited annotated data resources.

Although this annotation scenario has not yet been validated through empirical experimentation, the model’s design and zero-shot performance indicate its potential suitability for such applications. Investigating the use of the proposed model to provide weak or auxiliary supervision for relation classification represents a promising direction for future research in scientific information extraction.

7. Discussion

This study makes a valuable contribution to the information extraction from scientific documents, particularly in the context of multi-domain data and low-resource languages. The creation of a novel annotated dataset and the evaluation of entity recognition and relation extraction models offer significant insights into the challenges and opportunities in this area. However, a thorough review of the findings indicates numerous constraints that necessitate further deliberation.

Our findings indicate that spaCy, a relatively lightweight model, often surprisingly outperformed larger, more complex models as BERT and LLaMA. This counterintuitive result can be attributed to several factors. Firstly, SpaCy library provides full-fledged models and pipelines for several production-oriented NLP tasks (https://spacy.io/models (accessed on 17 July 2025), https://spacy.io/usage/processing-pipelines (accessed on 17 July 2025)). Secondly, we fine-tuned spaCy’s NER model for Russian and Kazakh languages to further optimize its performance for the domains considered in the study. SpaCy’s pre-trained models, though smaller, are often trained on large, general-purpose corpora, providing a strong foundation for transfer learning to scientific domains. In contrast, while LLaMA possesses extensive knowledge, its general-purpose nature may require substantial fine-tuning on domain-specific data to achieve optimal performance. The lack of sufficient domain-specific training data in our low-resource setting may have hindered LLaMA’s ability to adapt effectively. Furthermore, the computational cost of fine-tuning LLaMA is significantly higher than that of spaCy, making it less practical for many real-world applications. While GLiNER generally performs well, its comprehensiveness and accuracy may be limited for low-resource and non-Latin languages like Kazakh. To draw more reliable conclusions and determine whether the observed performance differences between models are significant, appropriate statistical testing should be conducted.

The annotation process itself introduces potential biases that could affect the quality of the dataset and the performance of the models trained on it. Annotators may have varying interpretations of the annotation guidelines, leading to inconsistencies in the annotations. One of the significant limitations of this study lies in the uneven distribution of entities and relations across the four domains in our dataset. The Medicine domain contains a higher number of entities compared to IT, Linguistics, and Psychology. Models trained and evaluated on this dataset may exhibit a preference for recognizing entities prevalent in the Medicine domain, leading to inflated performance metrics for this domain and potentially underestimating performance in other domains. To verify this, future work should address this issue by employing techniques such as data augmentation, re-sampling, or cost-sensitive learning to mitigate the impact of class imbalance.

While our RE model demonstrates promising zero-shot capabilities, its domain transferability remains a key concern. The model’s performance relies heavily on the quality and relevance of the textual definitions of relation classes used during inference. If these definitions are ambiguous, incomplete, or not representative of the target domain, the model’s performance could suffer significantly. Furthermore, its ability to generalize to unseen relations is limited by its reliance on pre-trained language models and the knowledge they encode. If a relation is not well-represented in the pre-training data, our model might struggle to accurately identify it. Future research should focus on enhancing the model’s ability to transfer across domains. This could involve integrating domain-specific knowledge into the model or designing approaches for automatically generating more informative and precise relation definitions.

8. Conclusions

In the present paper, a novel multi-domain dataset of scientific documents in Russian and Kazakh is presented, annotated with entities and relations. A comprehensive analysis of model effectiveness was conducted by analyzing the results obtained during experiments on entity recognition across four models (BERT, LLaMA, GLiNER and spaCy). These models were evaluated on two languages (Kazakh and Russian) and four domains (IT, Linguistics, Medicine, and Psychology).

In order to solve the relation classification task under zero-shot learning conditions, we proposed a new method for building a language model that leverages a multimodal representation of input data by combining the context of the sentence, mentions of the entities, and textual definitions of the relation classes. Exploring this method for generating weak or auxiliary supervision for relation classification signifies a promising avenue for future research in the domain of scientific information extraction.

Author Contributions

Conceptualization, methodology, project administration, and supervision, T.B.; software and investigation, N.M., B.S., N.S. and T.B.; data curation, N.M., A.Y., T.B. and M.T.; writing—original draft preparation, T.B., A.Y., N.M. and N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP22787186).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available on Zenodo https://doi.org/10.5281/zenodo.15882221 (accessed on 17 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prompt for Entity Annotation

System prompt:

You are a scientific expert. Be concise and accurate.

User prompt:

You have a text provided by the user. Mark important scientific terms (with their IDs and labels) in the text. For labeling, use only the labels you’ve seen in the example. Do not change the original text, even commas.

“Introduction. Cardiac arrhythmias are the most common among the group of cardiovascular diseases (CVD), and have a risk of sudden cardiac death (SCD). Long QT syndrome (LQTS) is a heritable disease characterized by prolongation of the QT interval on an electrocardiogram (ECG), which often leads to syncope and SCD. Currently, identification of mutations in cardiac ion channel genes in patients with LQTS and recognition of genetic causes of the syndrome are actual in cardiology. Aim. To identify cardiac ion channel mutations in genes associated with long QT syndrome in Kazakhstani patients. Materials and methods. This study was designed as a cohort study. At present, our study has identified three patients with LQTS. Nevertheless, the recruitment of additional patients with LQTS for the study is ongoing. The Illumina TruSight Cardio panel was used for genetic screening. The cardiopanel consists of 174 genes associated with cardiac disorders, including LQTS. After a targeted sequencing, data analysis was carried out using the programs SureCall version 2.0.7.0 (Agilent Technologies, Santa Clara, CA, USA), ANNOVAR, GTK, bwa, bowtie, bow tie 2, VarScan, etc. Results. Clinically significant variants were found in patients with LQTS. Namely, in genes SCN5A (c.G5296A:p.E1766K) and KCNH2 (c.C662T:p.A221V). Both variants are pathogenic and cause CVDs, specifically LQTS. In addition, c.G3785A mutation (p.R1262Q), a variant of uncertain significance in SCN5A gene was detected in one patient. Although there is insufficient data to determine the role of the variant in the development of the disease. Conclusions. Screening for mutations in cardiac ion channel genes in patients with LQTS revealed clinically significant mutations. This research will be useful for Kazakhstani patients with LQTS in evaluation of required genetic testing and reliable genetic guidance to prevent SCD and distinguish between various arrhythmias.”

Assistent message:

“Introduction. [Cardiac arrhythmias|id1|TERM] are the most common among the group of [cardiovascular diseases|id2|TERM] ([CVD|id3|TERM]), and have a risk of [sudden cardiac death|id4|TERM] ([SCD|id5|TERM]). [Long QT syndrome|id6|TERM] ([LQTS|id7|TERM]) is a [heritable disease|id8|TERM] characterized by [prolongation of the QT interval|id9|TERM] on an [electrocardiogram|id10|TERM] ([ECG|id11|TERM]), which often leads to [syncope|id12|TERM] and [SCD|id13|TERM]. Currently, identification of mutations in [cardiac ion channel|id14|TERM] genes in patients with [LQTS|id15|TERM] and recognition of genetic causes of the syndrome are actual in [cardiology|id16|TERM]. Aim. To identify [cardiac ion channel|id17|TERM] [mutations in genes|id18|TERM] associated with [long QT syndrome|id19|TERM] in Kazakhstani patients. Materials and methods. This study was designed as a cohort study. At present, our study has identified [three patients|id20|VALUE] with [LQTS|id21|TERM]. Nevertheless, the recruitment of additional patients with [LQTS|id22|TERM] for the study is ongoing. [Illumina TruSight Cardio panel|id23|TERM] was used for [genetic screening|id24|TERM]. The [cardiopanel|id25|TERM] consists of [174 genes|id26|VALUE] associated with [cardiac disorders|id27|TERM] including [LQTS|id28|TERM]. After a [targeted sequencing|id29|TERM], [data analysis|id30|TERM] was carried out using the programs [SureCall|id31|TERM] [version 2.0.7.0|id32|VALUE] (Agilent Technologies, Santa Clara, California, USA), [ANNOVAR|id33|TERM], [GTK|id34|TERM], [bwa|id35|TERM], [bowtie|id36|TERM], [bow tie 2|id37|TERM], [VarScan|id38|TERM], etc. Results. Clinically significant variants were found in patients with [LQTS|id39|TERM]. Namely, in genes [SCN5A|id40|TERM] ([c.G5296A:p.E1766K|id41|TERM]) and [KCNH2|id42|TERM] ([c.C662T:p.A221V|id43|TERM]). Both variants are pathogenic and cause [CVDs|id44| TERM], specifically [LQTS|id45|TERM]. In addition, [c.G3785A mutation|id46|TERM] ([p.R1262Q|id47|TERM]), a variant of uncertain significance in [SCN5A |id48|TERM] gene was detected in [one patient|id49|VALUE]. Although there is insufficient data to determine the role of the variant in the development of the disease. Conclusions. Screening for mutations in [cardiac ion channel|id50|TERM] genes in patients with [LQTS|id51|TERM] revealed clinically significant mutations. This research will be useful for Kazakhstani patients with [LQTS|id52|TERM] in evaluation of required [genetic testing|id53|TERM] and [reliable genetic guidance|id54|TERM] to prevent [SCD|id55|TERM] and distinguish between various [arrhythmias|id56|TERM].”

Appendix B. Prompt for Relation Annotation

User message:

Consider this text with entity markup. “[INPUT]”

Separate the text into sentences with IDs: S1, S2, S3, etc.
For each sentence, identify relations. Each relation must connect entities with different IDs. Choose only one relation type per pair, if you have several variants.
Use only the following types of relations:
SUBCLASS_OF: Indicates a hierarchical relation where one entity is a specific type of another.
SYNONYM: Connects the full name of a term with its abbreviation or translation to another language.
HAS_CHARACTERISTIC: Indicates a property or attribute of an entity.
HAS_USE: Indicates that one entity has a specific application or usage of an entity.
HAS_PART: Indicates that one entity consists of another as its component.
HAS_VALUE: Indicates a quantitative value associated with an entity.
Collect relations from all sentences to a single list in a format like in the following example.

Input text example:

[Сердечные аритмии|T1|TERM] являются наибoлее распрoстраненным забoлеванием в группе [сердечнo-сoсудистых забoлеваний|T2|TERM] ([ССЗ|T3|TERM]). На данный мoмент наше исследoвание выявилo [трех пациентoв|T4|VALUE] с [СУУQT|T5|TERM]. Для [генети-ческoгo скрининга|T6|TERM] испoльзoвалась панель – [Illumina TruSight Cardio panel|T7|TERM]. [Кардиoпанель|T8|TERM] сoстoит из [174 генoв|T9|VALUE], связанных с [сердечнo-сoсудистыми нарушениями|T10|TERM]. Крoме тoгo, у [oднoгo пациента|T11|VALUE] была oбнаружена [мутация c.G3785A|T12|TERM]. [Расстрoйства аутистическoгo спектра|T13|TERM] ([PAC471|T14|TERM]) – этo [Неврoлoгические расстрoйства|T15|TERM], характеризующиеся [прoблема-ми в сoциальнoй кoммуникации|T16|TERM]. [Дефицит 21-гидрoксилазы|T17|TERM], вызванный [мутациями|T19|TERM] гена [CYP21A2|T20|TERM], сoставляет бoлее [90-95% случаев|T21|

VALUE] [врoжденнoй дисфункции кoры надпoчечникoв|T22|TERM].

Output relations example:

R1 SUBCLASS_OF Arg1:T1 Arg2:T2

R2 SYNONYM Arg1:T2 Arg2:T3

R3 HAS_CHARACTERISTIC Arg1:T4 Arg2:T5

R4 HAS_USE Arg1:T6 Arg2:T7

R5 HAS_PART Arg1:T8 Arg2:T9

R6 HAS_CHARACTERISTIC Arg1:T9 Arg2:T10

R7 HAS_VALUE Arg1:T4 Arg2:T11

R8 HAS_CHARACTERISTIC Arg1:T13 Arg2:T14

R9 SUBCLASS_OF Arg1:T13 Arg2:T15

R10 HAS_CHARACTERISTIC Arg1:T15 Arg2:T16

R11 HAS_CHARACTERISTIC Arg1:T17 Arg2:T19

R12 HAS_CHARACTERISTIC Arg1:T19 Arg2:T20

R13 HAS_VALUE Arg1:T17 Arg2:T21

R14 HAS_CHARACTERISTIC Arg1:T21 Arg2:T22

References

Liu, F.; Vulić, I.; Korhonen, A.; Collier, N. Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Online, 1–6 August 2021; pp. 565–574. [Google Scholar]
Zhang, Q.; Chen, Z.; Pan, H.; Caragea, C.; Latecki, L.J.; Dragut, E.C. SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024. [Google Scholar]
Piskorski, J.; Babych, B.; Kancheva, Z.; Kanishcheva, O.; Lebedeva, M.; Marcinczuk, M.; Nakov, P.; Osenova, P.; Pivovarova, L.; Pollak, S.; et al. Slav-NER: The 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic languages. In Proceedings of the Workshop on Balto-Slavic Natural Language Processing 2021, Kiev, Ukraine, 19–20 April 2021; pp. 122–133. [Google Scholar]
Linguistic Data Consortium. ACE 2005 Multilingual Training Corpus, 2006. LDC2006T06. Available online: https://catalog.ldc.upenn.edu/LDC2006T06 (accessed on 10 June 2025).
Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; Manning, C.D. Position-aware Attention and Supervised Data for Slot Filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 35–45. [Google Scholar]
Luan, Y.; He, L.; Ostendorf, M.; Hajishirzi, H. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3219–3232. [Google Scholar] [CrossRef]
Bruches, E.; Tikhobaeva, O.; Dementyeva, Y.; Batura, T. TERMinator: A System for Scientific Texts Processing. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 3420–3426. [Google Scholar]
Loukachevitch, N.; Artemova, E.; Batura, T.; Braslavski, P.; Ivanov, V.; Manandhar, S.; Pugachev, A.; Rozhkov, I.; Shelmanov, A.; Tutubalina, E.; et al. NEREL: A Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links. Lang. Resour. Eval. 2024, 58, 547–583. [Google Scholar] [CrossRef]
Yeshpanov, R.; Khassanov, Y.; Varol, H.A. KazNERD: Kazakh Named Entity Recognition Dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 417–426. [Google Scholar]
Loukachevitch, N.; Manandhar, S.; Baral, E.; Rozhkov, I.; Braslavski, P.; Ivanov, V.; Batura, T.; Tutubalina, E. NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities. Bioinformatics 2023, 39, btad161. [Google Scholar] [CrossRef] [PubMed]
Haisa, G.; Altenbek, G. Deep Learning with Word Embedding Improves Kazakh Named-Entity Recognition. Information 2022, 13, 180. [Google Scholar] [CrossRef]
Tolegen, G.; Toleu, A.; Mussabayev, R. Voted-Perceptron Approach for Kazakh Morphological Disambiguation. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), Marseille, France, 11–12 May 2020; pp. 258–264. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Akhmetov, I.; Aubakirov, S.; Saparov, T.; Mussabayev, R.; Toleu, A.; Krassovitskiy, A. Machine Learning Methods for Kazakh Morphology: A Comprehensive Overview. In Proceedings of the 2024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE), Novosibirsk, Russia, 15–17 November 2024; pp. 1880–1884. [Google Scholar]
Mansurova, M.; Rakhimova, D. Morphological parsing of Kazakh texts with deep learning approaches. J. Math. Mech. Comput. Sci. 2024, 124, 48–58. [Google Scholar] [CrossRef]
Kozhirbayev, Z.; Yessenbayev, Z. Named entity recognition for the kazakh language. KazNU Bull. Math. Mech. Comput. Sci. Ser. 2020, 107, 57–66. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Zaratiana, U.; Tomeh, N.; Holat, P.; Charnois, T. GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; pp. 5364–5376. [Google Scholar] [CrossRef]
Miranda, L.J.; Kádár, Á.; Boyd, A.; Van Landeghem, S.; Søgaard, A.; Honnibal, M. Multi hash embeddings in spaCy. arXiv 2022, arXiv:2212.09255. [Google Scholar] [CrossRef]
Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text embeddings by weakly-supervised contrastive pre-training. arXiv 2022, arXiv:2212.03533. [Google Scholar]
Wang, S.; Zhang, B.; Xu, Y.; Wu, Y.; Xiao, B. RCL: Relation Contrastive Learning for Zero-Shot Relation Extraction. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 2456–2468. [Google Scholar] [CrossRef]
Wu, S.; He, Y. Enriching Pre-trained Language Model with Entity Information for Relation Classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM’19. New York, NY, USA, 3–7 November 2019; pp. 2361–2364. [Google Scholar] [CrossRef]

Figure 1. Number of unique terms per domain by language.

Table 1. Entity statistics (data in Russian). Total values across all domains are highlighted in bold.

Domain	Texts	Sentences	Words	Average Words	TERM	VALUE	Total Entities
IT	53	467	9771	183	1144	48	1192
Linguistics	49	511	12,456	254	1496	27	1523
Medicine	52	790	15,548	299	2003	390	2393
Psychology	53	480	11,641	220	1351	131	1482
Overall	207	2248	49,416	239	5994	596	6590

Table 2. Entity statistics (data in Kazakh). Total values across all domains are highlighted in bold.

Domain	Texts	Sentences	Words	Average Words	TERM	VALUE	Total Entities
IT	53	475	8945	169	1173	43	1216
Linguistics	49	515	11,392	233	1711	28	1739
Medicine	52	781	14,314	275	2435	418	2853
Psychology	53	500	10,780	203	1530	122	1652
Overall	207	2271	45,431	220	6849	611	7460

Table 3. Relation descriptions.

Relation	Description	Wikidata Property
HAS_CHARACTERISTIC	Indicates a property, feature, or attribute of an entity	https://www.wikidata.org/wiki/Property:P1552 (accessed on 17 July 2025)
HAS_PART	Indicates that one entity includes another entity as a component	https://www.wikidata.org/wiki/Property:P527 (accessed on 17 July 2025)
HAS_USE	Indicates the purpose or application of an entity	https://www.wikidata.org/wiki/Property:P366 (accessed on 17 July 2025)
HAS_VALUE	Indicates a quantitative value associated with an entity	none
SUBCLASS_OF	Indicates a hierarchical relation in which one entity is a specific type of another	https://www.wikidata.org/wiki/Property:P279 (accessed on 17 July 2025)
SYNONYM	Connects the full name of a term with its abbreviation or translation	https://www.wikidata.org/wiki/Property:P5973 (accessed on 17 July 2025)

Table 4. Relation statistics (data in Russian). Total values across all relations are highlighted in bold.

Relation	IT	Linguistics	Medicine	Psychology	Total
HAS_CHAR	76	129	137	98	440
HAS_PART	79	85	24	41	450
HAS_USE	187	129	73	99	488
HAS_VALUE	29	19	243	65	356
SUBCLASS_OF	99	96	65	104	364
SYNONYM	35	7	69	16	127
Total	505	475	837	428	2246

Table 5. Relation statistics (data in Kazakh). Total values across all relations are highlighted in bold.

Relation	IT	Linguistics	Medicine	Psychology	Total
HAS_CHAR	25	154	224	156	559
HAS_PART	126	257	313	157	853
HAS_USE	135	108	100	90	433
HAS_VALUE	25	18	262	46	351
SUBCLASS_OF	68	67	95	31	261
SYNONYM	27	21	45	9	102
Total	406	625	1039	489	2559

Table 6. BERT results. The best values are highlighted in bold.

Language	Domain	Precision, %	Recall, %	F1 Score, %
Kazakh	IT	30.52	25.00	27.49
	Linguistics	43.84	39.16	41.37
	Linguistics+IT	37.97	32.26	34.89
	Medicine	29.82	33.84	31.70
	Psychology	39.62	35.11	37.23
	All domains	34.89	34.12	34.50
Russian	IT	72.93	76.59	74.71
	Linguistics	72.70	75.17	73.91
	Linguistics+IT	72.11	75.31	73.68
	Medicine	66.45	69.78	68.08
	Psychology	65.73	68.28	66.98
	All domains	67.22	70.62	68.88

Table 7. LLaMA results. The best values are highlighted in bold.

Language	Domain	Precision, %	Recall, %	F1 Score, %
Kazakh	IT	39.37	33.75	35.53
	Linguistics	38.29	40.47	38.46
	Linguistics+IT	40.49	37.19	37.30
	Medicine	38.89	40.20	39.32
	Psychology	40.86	47.96	42.85
	All domains	54.83	54.84	53.36
Russian	IT	42.34	47.23	43.70
	Linguistics	49.75	55.93	51.95
	Linguistics+IT	49.01	50.43	48.41
	Medicine	57.00	45.25	50.42
	Psychology	35.64	53.06	41.94
	All domains	57.67	52.19	54.44

Table 8. GLiNER results. The best values are highlighted in bold.

Language	Domain	Precision, %	Recall, %	F1 Score, %
Kazakh	IT	55.98	42.92	48.33
	Linguistics	68.59	54.15	59.52
	Linguistics+IT	65.96	52.03	57.05
	Medicine	68.98	44.89	54.18
	Psychology	62.43	41.86	50.06
	All domains	80.99	73.20	75.37
Russian	IT	51.04	50.20	50.51
	Linguistics	70.56	62.17	64.24
	Linguistics+IT	58.94	56.28	56.45
	Medicine	80.33	54.28	64.58
	Psychology	62.13	57.90	59.26
	All domains	83.13	78.67	79.65

Table 9. SpaCy results. The best values are highlighted in bold.

Language	Domain	Precision, %	Recall, %	F1 Score, %
Kazakh	IT	90.96	76.26	82.96
	Linguistics	93.69	83.10	88.08
	Linguistics+IT	98.82	97.07	97.94
	Medicine	94.22	89.13	91.61
	Psychology	92.71	81.79	86.91
	All domains	97.75	95.95	96.84
Russian	IT	89.79	88.58	85.49
	Linguistics	92.48	84.24	88.17
	Linguistics+IT	98.47	98.13	98.30
	Medicine	88.69	83.72	86.13
	Psychology	92.01	84.62	88.16
	All domains	97.76	96.97	97.36

Table 10. Relation classification results in classic and zero-shot settings. The best values are highlighted in bold.

	SciMDIX (Rus → Rus)	SciMDIX (Rus → Kaz)
BERT	0.585	0.536
E5	0.556	0.637
R-BERT	0.687	0.564
Relation E5	0.578	0.590
Ours	0.639	0.640

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Batura, T.; Yerimbetova, A.; Mukazhanov, N.; Shvarts, N.; Sakenov, B.; Turdalyuly, M. Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Appl. Sci. 2025, 15, 9086. https://doi.org/10.3390/app15169086

AMA Style

Batura T, Yerimbetova A, Mukazhanov N, Shvarts N, Sakenov B, Turdalyuly M. Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Applied Sciences. 2025; 15(16):9086. https://doi.org/10.3390/app15169086

Chicago/Turabian Style

Batura, Tatiana, Aigerim Yerimbetova, Nurzhan Mukazhanov, Nikita Shvarts, Bakzhan Sakenov, and Mussa Turdalyuly. 2025. "Information Extraction from Multi-Domain Scientific Documents: Methods and Insights" Applied Sciences 15, no. 16: 9086. https://doi.org/10.3390/app15169086

APA Style

Batura, T., Yerimbetova, A., Mukazhanov, N., Shvarts, N., Sakenov, B., & Turdalyuly, M. (2025). Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Applied Sciences, 15(16), 9086. https://doi.org/10.3390/app15169086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

Abstract

1. Introduction

2. Related Work

3. Data Preparation

3.1. Entity Annotation

3.2. Relation Annotation

4. Entity Recognition Methods

4.1. BERT-Based Model

4.2. LLaMA-Based Model

4.3. GLiNER

4.4. SpaCy Model

5. Results of Entity Recognition

6. Relation Extraction Methods

6.1. Data Preprocessing

6.2. Model Architecture

6.3. Training and Evaluation Details

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Prompt for Entity Annotation

Appendix B. Prompt for Relation Annotation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI