Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts

Abdygalym, Bayangali; Tazhibayeva, Saule; Sambetbayeva, Madina; Yerimbetova, Aigerim; Taberkhan, Roman; Abjalova, Manzura; Sabdenov, Aidos; Daiyrbayeva, Elmira

doi:10.3390/computers15050297

Open AccessArticle

Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts

by

Bayangali Abdygalym

^1,2,3

,

Saule Tazhibayeva

^2,*,

Madina Sambetbayeva

^1,2,3,*,

Aigerim Yerimbetova

^1,4,*

,

Roman Taberkhan

²

,

Manzura Abjalova

⁵

,

Aidos Sabdenov

⁶ and

Elmira Daiyrbayeva

^1,7

¹

Institute of Information and Computational Technologies of the Committee Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan, Almaty 050010, Kazakhstan

²

Department of Information Systems, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan

³

School of Information Technology and Engineering, Astana International University, Astana 010000, Kazakhstan

⁴

School of Engineering and Information Technology, META University, Almaty 050012, Kazakhstan

⁵

Department of Theoretical and Applied Linguistics, Tashkent State University of Uzbek Language and Literature, Tashkent 100060, Uzbekistan

⁶

Department of Computer Engineering, International University of Information Technologies, Almaty 050000, Kazakhstan

⁷

Department of Software Engineering, Satbayev University, Almaty 050010, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 297; https://doi.org/10.3390/computers15050297

Submission received: 17 March 2026 / Revised: 28 April 2026 / Accepted: 4 May 2026 / Published: 7 May 2026

Download

Browse Figures

Versions Notes

Abstract

Integrating structured linguistic resources into deep learning architectures represents a key challenge in domain-oriented NLP. This study proposes a framework for incorporating knowledge from a military thesaurus of the Ground Forces, structured according to the XML Zthes standard, into pre-trained transformed language models, including KazBERT, multilingual BERT, and XLM-RoBERTA. The approach addresses two interrelated tasks in specialized terminology processing: concept linking and semantic search. Unlike existing knowledge-injection methods designed primarily for general-domain applications, this framework formalizes the mapping of Zthes elements, such as Term, Broader term, Narrower term, Related term, ScopeNote, Language, and Source into structured textual representations that can be directly processed by transformer architectures. Fine-tuning is conducted on a dataset of 18,400 training instances automatically generated from the thesaurus, including synonym pairs, hierarchical relations (hyperonymy and hyponymy), associative links, and definitional descriptions. Experimental evaluation demonstrated that thesaurus-enriched models outperform baseline architectures across all major metrics. XLM-RoBERTA model achieves F1 = 0.84 and Top-5 accuracy = 0.94 in the concept linking task, representing a five-point improvement over the baseline. The model reaches Macro-F1 = 0.84 across four relation types. Results obtained on a specialized test set derived from terminology databases of Kazakhstan’s Armed Forces confirm robust cross-lingual generalization across Kazakh, Russian and English military discourse.

Keywords:

integration of thesaurus; transformer models; binding concepts; semantic search; KazBERT; XLM-Roberta; Zthes; NLP with knowledge enrichment; domain adaptation

Graphical Abstract

1. Introduction

The subject-oriented processing of natural language in areas critical to security and related to high-tech disciplines—defense, aerospace and military operations—involves a number of specific methodological problems. In addition to these structural characteristics, the military subject area combines properties that make it a particularly significant and complex context for Natural Language Processing (NLP) studies [1]. First, in military discourse, terminological inaccuracy is not just an academic inconvenience; it can have operational consequences [2]. The same term, such as “strike asset”, “tactical unit” or “zone of responsibility”, can refer to different concepts depending on the doctrinal context, chain of command or level of operations (strategic, operational, or tactical). Such polysemy, deeply rooted in institutional use, creates a level of semantic complexity that general-purpose NLP models do in principle poorly [3]. This sharply distinguishes the military sphere, for example, from the legal or medical NLP [4,5], where terminological standards, although strict, are documented more uniformly and studied in corpus linguistics for decades. Secondly, the trilingual operating environment of the Armed Forces of the Republic of Kazakhstan—Kazakh, Russian and English—places high demands on cross-language consistency, which is especially difficult given the low resource status of the Kazakh language [6,7] in the NLP community and the almost complete absence of public military corps [8,9,10]. Thirdly, unlike many specialized fields that either lack structured terminological resources or have long been the subject of intensive computational research, the military field occupies an unusual position: it has a well-designed, institutionally supported thesaurus and classification standards, but these resources have not yet been systematically integrated with modern neural architectures. This combination of formalized terminological infrastructure, insufficient research, multilingual complexity and high requirements for accuracy makes the military field not just an example of application, but a rigorous and meaningful test environment for the methodology proposed in this paper. However, their systematic integration with modern neural language models remains an insufficiently studied task.

In fact, this limitation is multidimensional; the main factors are as follows:

Although significant advances have been made in the biomedical NLP community in integrating structured knowledge using resources such as UMLS and Mesh (e.g., KnowledgeBERT [11]; BioALBERT [12]), systematic study of thesaurus and transformer model integration in military, defense and security fields is still poorly represented. Existing research in military NLP mainly focuses on the recognition of named entities and extraction of information [13,14], while the problems of binding concepts and semantic search based on structured terminological resources remain largely unexplored.
Structured terminological standards such as Zthes [15] or SKOS (Simple Knowledge Organization System) encode relational knowledge (hierarchical, associative, and definition) in formal XML schemes that are initially incompatible with the requirements of transformational architectures that work with linear text input. At the moment, there is no standardized pipeline for the systematic transformation of thesaurus elements, such as BT (Broader term), NT (Narrower term), RT (related term), and ScopeNote, into linguistically coherent text sequences suitable for the further training of models. This represents a specific technical gap, different from the problem of data scarcity.
The lack of publicly available marked datasets created on the basis of institutional terminological resources in multilingual specialized fields (for example, Kazakh–Russian–English military terminology) further limits reproducible research in this field. Most of the existing benchmarks for knowledge-enriched NLP are English-speaking and focused on a common subject area.

The core research contributions of the work are as follows:

A formalized mapping of elements of the Zthes standard [15] into text representations compatible with the input of transformer models is proposed, which allows for the systematic injection of structured knowledge without changing the architecture of the base model.
A training dataset of 400 relational examples was formed, automatically extracted from the thesaurus of the Ground Forces of the Armed Forces of the Republic of Kazakhstan and organized into six types of tasks.
A comparative experimental study of three multilingual transform models, such as KazBERT, mBERT and XLM-Roberta, in basic and thesaurus-enriched conditions, with an assessment of accuracy, completeness, F1-measure and top-k accuracy metrics was carried out.
A reproducible test protocol for the problem of binding concepts in the military domain, demonstrating stable interlingual generalization between Russian, Kazakh and English languages, is proposed.

2. Review of Literature

The emergence of large pre-trained language models based on the Transformer architecture [16] fundamentally changed the paradigm of natural language processing. BERT [1] and Roberta [17] models have demonstrated high efficiency in a wide range of tasks thanks to self-controlled pre-training on large text cases. However, when applied to highly specialized domains, general pre-education is insufficient, since it does not provide adequate coverage of professional terminology and the conceptual specificity of subject texts.

In response to this problem, a strategy for domain adaptation was proposed through additional training of models on specialized cases. BioBERT [18] and PubMedBERT [19] models formed this paradigm for the biomedical domain, demonstrating steady improvements in general domain models in the tasks of recognizing named entities, extracting relationships, and answering questions. SciBERT [20] extended this approach to scientific texts, while LegalBERT [21] adapted the transformer architecture for the legal domain. In the military context [8], it was shown that domain pre-training provides an increase in accuracy by 8–12 points, in terms of F1, for entities of the “technics” and “compounds” types in Russian-language military texts. Nevertheless, these models share a fundamental limitation: they internalize statistical patterns of texts, but do not form explicit, structured representations of domain knowledge, such as hierarchical, associative, and determinative relationships, which are systematically fixed in controlled dictionaries and thesauri.

Existing approaches to enriching neural language models with external knowledge can be conditionally divided into two main directions: the injection of knowledge at the pre-learning stage and the injection of knowledge at the stage of fine-tuning. Within the first direction, the ERNIE model [22] became one of the first systems to implement the injection of knowledge at the entity level by masking the elements of the knowledge base in the pre-learning process. The K-BERT architecture proposed the inclusion of knowledge triplets directly into the input sentence graph, whereas the KEPLER model [23] combined masked language modeling and learning of knowledge representations into a single multitasking statement, ensuring alignment of distributed and structured representations.

Among the methods of the second direction, the retrieval-augmented learning architecture became important in the stages of development. The rag model [24] implements the dynamic extraction of relevant knowledge during inference, and SapBERT [25] applies self-alignment goals on pairs of synonyms from the UMLS base to train representations of biomedical entities resistant to surface-form variability. Conceptually, this approach is closest to the strategy implemented in this work.

However, a significant limitation of these methods is that they are designed primarily for large universal knowledge graphs, such as Wikidata, Freebase, or UMLS. These systems do not address the task of integrating a specialized domain-controlled thesaurus with a complex hierarchical and associative structure characteristic of professional areas with limited resources.

The use of thesauri and ontologies to enrich NLP systems has a solid theoretical basis in the science of knowledge organization. ISO 25964 [26] defines a thesaurus as a structured system of controlled indexing language formalizing equivalence relations (USE/UF (Use-For)), hierarchy (BT/NT) and association (RT (Related Term)) between concepts. The SKOS model [27], developed by the W3C consortium, provides the serialization of these relationships in RDF format, which makes possible machine-readable integration of thesaurus structures into the semantic web infrastructure.

Modern literature demonstrates a steady transition from static alphabetic dictionaries to conceptually oriented, machine-processed knowledge structures. Salatino et al. [28] showed that the use of thesauri and ontologies significantly increases the suitability of information resources for artificial intelligence and semantic analytics systems. Kraus et al. [29] proposed an LLM-oriented pipeline for automated translation of SKOS thesaurus between languages and noted the systematic decline in the quality of ontology comparison in the case of non-English-speaking resources, pointing to the need to develop specialized tools for working with multilingual thesauri.

In the military-applied context [9], we developed a hybrid architecture that combines POS-tagging and neural network methods for automatic extraction of terms from military text cases. Teze and Nazaruka [30] proposed a method of semantic intersection for cross-linguistic alignment of thesaurus in the defense field using transformational refinement. Ref. [31] described the Zthes-compliant scheme of the trilingual terminological database of the Ministry of Defense of the Republic of Kazakhstan, providing the immediate context for the present study.

Despite the progress made, most existing approaches use thesaurus primarily as support resources for pre-processing or static reference systems. Hierarchical and associative thesaurus signals (BT/NT/RT) are practically not integrated directly into the process of learning representations of neural models.

The paradigmatic transition from sparse search to dense search, in which queries and documents are encoded in a common space of vector representations and ranked by cosine similarity, was proposed in [32] and further developed in the architectures by DPR [33] and Colbert [34]. The BEIR benchmark [35] showed that dense search models trained on domain-wide hulls significantly lose quality when transferring to specialized domain collections, which confirms the need for domain-oriented presentation training.

In the military NLP context, the Multi_mil [2] corpus, which is a multilingual resource in Kazakh, Russian and English, provides an empirical basis for evaluating such systems. The review, published in the journal IEEE Access [8], captures a steady increase in interest in the problems of named entity recognition, thematic modeling and other NLP approaches for the analysis of military terminology. At the same time, the authors note the lack of structured thesaurus resources as one of the key factors limiting the development of semantic interoperability and improving the quality of information search.

The problem of cross-language terminological alignment is particularly relevant in multilingual professional domains, where concepts do not have direct correspondences between languages due to doctrinal, organizational and cultural differences. In the military field, terms such as Squad or Platoon do not always have direct structural equivalents in various national military systems, which requires contextual and hierarchical alignment of concepts instead of direct lexical translation. NATO’s terminological infrastructure, including NATOTerm [36], demonstrates an institutional approach to multilingual terminology standardization, in which the thesaurus is not an auxiliary dictionary, but an element of the infrastructure of doctrinal and information compatibility.

At the computational level, multilingual models—mBERT [37], LaBSE [38] and XLM-Roberta [39]—provide zero-shot cross-language translation for classification and search tasks. However, like their monolingual counterparts, they do not use the structured equivalence and hierarchy signals characteristic of a controlled thesaurus. Thus, the integration of SKOS-coded multilingual thesaurus knowledge into the process of further training of cross-language transformer models remains a poorly researched direction.

The analysis of the literature allows us to identify a multidimensional research gap. First, the domain adaptation methods of transformer models, despite their empirical effectiveness, do not provide access to explicit structured terminological knowledge. Secondly, existing knowledge integration frameworks (ERNIE, K-BERT, KEPLER, and SapBERT) are focused mainly on large universal knowledge graphs and do not solve the problem of integrating a user-controlled thesaurus with a full-fledged relational BT/NT/RT/UF structure. The semantic significance of these types of relationships is not interchangeable. BT and NT relations are inherently directional and asymmetric; they encode inclusion hierarchies in which the direction of the relationship determines taxonomic depth and conceptual scope. A model that cannot distinguish between BT and NT will form structurally inverted conclusions by treating the subordinate concept as superfixed, which directly affects the accuracy of the binding concepts. RT relationships, by contrast, encode a non-hierarchical associative affinity that subject experts have explicitly recognized as semantically significant; such associations cannot be reliably restored from distributive co-occurrence in low-resource enclosures, where functionally related terms may rarely appear in general contexts. Existing integration frameworks, including SapBERT and K-BERT, reduce all relational knowledge to a single undifferentiated signal of similarity and therefore cannot teach a model how the two concepts are connected, but only that they are connected. The proposed work removes this limitation by maintaining the typed relational structure of the thesaurus throughout the post-learning process and treating BT, NT, and RT as different learning goals rather than reducing them to a common criterion of closeness. Thirdly, the military NLP, despite the growth in the number of publications, remains insufficiently provided with structured thesaurus resources capable of simultaneously supporting semantic teaching of representations, information search and cross-linguistic interoperability. Fourth, the specific combination of trilingual coverage (Russian–English–Kazakh), standardized thesaurus structure and integration with transformational architecture was practically not considered in the existing literature.

This work is aimed at closing this research gap. The article proposes and experimentally tests the methodology of integrating the structured trilingual domain thesaurus of the terminology of the Ground Forces in the process of completing the training of a multilingual transformer model. The thesaurus provides relational BT/NT/RT signals and synonym pairs that are incorporated into the learning process through a specialized target criterion, ensuring the model’s sensitivity to the conceptual hierarchy and associative structure of military terminology. The proposed approach differs from previous works by the completeness of integration of the main types of thesaurus relations, trilingual coverage and systematic experimental evaluation on the tasks of semantic search and binding concepts in the military domain.

3. Ground Forces Thesaurus: Structure and Knowledge Extraction

The Ground Forces thesaurus used in the present study covers about 4500 concepts relating to various aspects of ground military activities, including Ground Forces equipment, organizational structures, operational concepts, terrain classification, fire support systems, communications and electronic warfare, as well as logistics and control elements.

Although this figure looks modest compared to large general-purpose knowledge bases such as Wikidata or UMLS, it is representative of institutionally maintained, operationally oriented subject thesaurus. In particular, in the field of defense and military terminology, the reported size of thesaurus for national resources created within a single institution usually varies from about 300 to 1500 concepts [30], which places the resource in question in the middle range of this category. NATOTerm, the most comprehensive multilateral system of defense terminology, covers about 14,000 terms, a scale achievable only through decades of multinational editorial coordination and therefore not a realistic reference point for national-level resources. The 18,400 training examples derived from the thesaurus reflect a deliberately chosen low-resource experimental mode; their structured and typed relational nature distinguishes them from free text annotations and makes each such example a more informational signal than is usually the case in general purpose post-learning datasets.

The Ground Forces Thesaurus was developed by a group of military subject matter specialists and terminologists affiliated with the Ministry of Defense of the Republic of Kazakhstan, as described in [31]. The construction process was based on a structured expert-oriented methodology, in which candidates for terms were first selected from authoritative doctrinal sources, including official military regulations, operational manuals and standardized catalogs of weapons and equipment published by the Armed Forces of the Republic of Kazakhstan. The selected terms were then verified and validated by subject experts with up-to-date military expertise to ensure doctrinal accuracy and operational relevance. Hierarchical relationships (BT/NT) were set on the basis of established doctrinal classification schemes and further checked for consistency with the official taxonomies of the organizational structure and armament of the Ground Forces. Associative relationships (RT) were determined on the basis of expert consensus of terminologists and reflected functionally and operationally significant conceptual relationships that are not reduced to hierarchical inclusion. Synonymous records (UF) were formed on the basis of variant designations found in official documents in Kazakh, Russian and English. The tri-lingual coverage was provided by parallel expert annotation, rather than automatic translation, to preserve doctrinal equivalence, not just lexical correspondence. The resulting resource was encoded in Zthes XML format and validated for schema compliance prior to use in this study. No automated term extraction tools were used to construct the relational structure; BT, NT, and RT are expert terminological judgments, not statistically derived associations.

Table 1 summarizes the semantic role of each element of the Zthes standard and the corresponding text representation used within the proposed framework. The mapping of the elements is formed in such a way as to preserve the relational information of the thesaurus as much as possible when constructing text sequences compatible with the tokenizers of all three transformational architectures under consideration.

The structured input sequence template for the thesaurus concept is as follows:

[CLS] [term] <basic_term> [SYN] <synonym_list> [BT] [NONE] <broader_term> [NT] <narrower_term> [REL] <related_term> [DEF] <definition> [SEP]

This template is a flat linear representation, fully tokenized by the tokenizers of the transformer models used, and automatically restored from any Zthes-compatible XML export without the need for manual data annotation. Special tokens (for example, [term], [SYN], [BT], and [NT]) are added to the model dictionary in the fine-tuning step and initialized as the average of subword attachments that make up the corresponding token.

In the event of the Zthes element being absent, the corresponding structural token is retained; however, the content slot is populated with a [NONE] placeholder. The truncation strategy employed involves the removal of values from the terminal point of multi-valued fields, prioritized in reverse order of importance (RT first, followed by NT). It is guaranteed that at least one value will be preserved for each non-empty field. Truncation was required for less than 3% of records (n = 15). The maximum input sequence length was 512 tokens. The total number of tokens required to reach the 95th percentile was 187, whereas 412 tokens were required to reach the 99th percentile. It should be noted that, while the [NONE] placeholder mechanism functions correctly at the data-representation level, the downstream impact of systematically incomplete entries on model performance has not been empirically quantified in the present study. A controlled missing-field robustness evaluation is identified as a priority for future work (see Section 6).

The ablation experiments show that the use of the complete representation pattern consistently exceeds partial variants of the input sequences for all evaluated problems. Ablation experiments were conducted to quantify the contribution of each Zthes element to the structured input representation, using the XLM-Roberta model on the concept binding problem (see Table 1) as a reference configuration. Starting with the basic version only with term (F1 = 0.80), which corresponds to the basic condition of retraining, the phased addition of synonymy information (SYN) yielded a moderate improvement to F1 = 0.81, which confirms that the variability of the surface form is a stable but limited source of increase in the quality of representations. The addition of definition context (DEF) raised F1 to 0.82, reflecting the contribution of semantic scope to removing ambiguity of concepts. Further incorporation of associative links (REL) has not yielded measurable improvement at this level of accuracy, which indicates the marginal contribution of non-hierarchical associative signals to the problem of binding concepts in the presence of a definitive context and information about synonymy. The most pronounced improvement was observed with the addition of hierarchical signals: the combined inclusion of BT and NT tokens increased F1 to 0.84, which gave an absolute increase of 0.02 points compared to the previous configuration and became the largest contribution among all the individual elements. This is consistent with the result given in the relationship classification analysis, where the addition of tokens [BT] and [NT] raises F1 hyponymy from 0.68 to 0.79 for KazBERT. The complete pattern, combining all elements, has reached F1 = 0.84, which confirms that hierarchical metadata of the thesaurus encode information that cannot be retrieved only by surface forms, synonyms, or definition text.

4. Methodology: Thesaurus BERT-MIL

The proposed framework is implemented in the form of a three-stage pipeline, the architecture of which is shown in Figure 1. At the first stage of data preparation, the Ground Forces thesaurus in the Zthes XML format serves as the only source of training material: from it, the terms and structured descriptions of the relationships on the basis of which the training pairs are formed; class balance is provided by the addition of random negative examples that serve as a contrast signal. In the second stage, the training stage, two models with fundamentally different architectures are optimized in parallel: a cross-encoder based on XLM-Roberta is trained as a six-class classifier of relationship types on alternating term-description pairs, while a bi-encoder based on BGE-M3 is optimized on relationship pairs using the soft similarity function. BGE-M3 serves as the bi-encoder component and is fixed across all experiments. KazBERT, mBERT and XLM-Roberta are evaluated as alternative cross-encoder backbones within the same pipeline. Upon completion of training, both models are used to construct dense vector attachments of all thesaurus concepts together with their definitions. The third stage—indexing and reranking—is implemented as a two-stage retrieve-and-rerank pipeline: a bi-encoder based on BGE-M3 performs an effective FAISS index search with a return of top-k = 300 candidates by cosine similarity of attachments, after which a cross-encoder based on XLM-Roberta performs a final reassessment of the resulting set of candidates by precise pairwise comparison, additionally using the four most likely candidates by communication type. The deliberate architectural separation of bi-encoders and cross-encoders is motivated by the efficiency–accuracy trade-off inherent in transformer-based retrieval. A cross-encoder requires a full joint forward pass for every query–candidate pair, which is computationally prohibitive when scoring against all |T| ≈ 4500 concept entries at inference time. The bi-encoder, by pre-computing concept embeddings offline, reduces first-stage retrieval to a single query encoding followed by a FAISS nearest-neighbor search, completed in milliseconds. Reranking is then restricted to the top-k = 300 shortlist, where the cross-encoder’s higher-capacity pairwise scoring is tractable. The result of the conveyor’s work is ranked pairs of concepts with markup on five types of thesaurus relations: BT, NT, SYN, RT and LE. The fundamental architectural solution is precisely the combination of a bi-encoder, which provides computationally efficient search across the entire concept space, and a cross-encoder, which performs high-precision reranking of a narrow set of candidates, which allows you to achieve a balance between the scalability of the system and the accuracy of binding concepts in a limited, specialized domain.

4.1. Input Presentation and Coding

Let T =

{t_{1}, t_{2}, \dots, t_{n}}

be the set of all concepts of the thesaurus, where each concept of t_i is associated with the structured record r_i containing elements of Zthes. The

E_{θ}

encoder is defined as a pre-trained transformer with parameter

θ

. Let

s_{i}

be the structured text sequence generated from the entry of the concept

r_{i},

according to the template described in Section 3 (that is, the planar tokenized serialization of the Zthes thesaurus fields: [TERM], [SYN], [BT], [NT], [REL], and [DEF]). Then, the concept embedding is defined as the CLS representation of the encoder output layer

e_{i} = E_{θ} {(s_{i})}_{[C L S]} \in R^{d}

(1)

where d is the dimension of the hidden layer (768 for BERT-base). In the concept binding problem, the model receives a request mention m—a text fragment from the operational document containing a terminological expression (for example, an abbreviation, paraphrase or a verbal term)—and must identify the corresponding entry of the thesaurus

t_{i} \in T

. Mentioning m may not coincide lexically with any canonical term of the thesaurus: the problem is to compare semantically close expressions, not exact string coincidence. The attachment of the request mention q is calculated in the same way as the embedding of the concept from (Equation (2)), and the binding is carried out by searching for the nearest neighbor in the space of embedding all the concepts of the thesaurus (see Equations (3) and (4)).

q = E_{θ (m)} [C L S] \in R^{d}

(2)

The concept binding is performed by calculating the cosine similarity between the request attachment and all precalculated concept attachments:

s i m (q, e_{i}) = \frac{(q \cdot e_{i})}{(|q| \cdot |e_{i}|)}

(3)

t = a r g \max_{t_{i} \in T} s i m (q, e (t_{i}))

(4)

where

e (t_{i}) = {E_{θ} (s (t_{i}))}_{[C L S]}

is the embedding of the concept

t_{i}

, calculated by the encoder from the structured sequence

s (t_{i})

. This clearly shows that

e_{i}

is a function of

t_{i}

through the intermediate textual representation

s_{i}

.

4.2. Loss Functions

The model is trained with a combined target function combining three losses:

L = λ_{1} L_{s y n} + λ_{2} L_{h i e r} + λ_{3} L_{r e l}

(5)

Loss of synonym alignment (

L_{s y n}

) uses contrast learning with in-batch negatives. For each anchor embedding of the concepts

e_{i}

and

e_{i}^{+}

, a positive example is the embedding of its synonym—the concept associated with

t_{i}

and the UF (use-for) ratio in the thesaurus, i.e.,

e_{i}^{+} = {E_{θ} (s (t_{i}^{U F}))}_{[C L S]}

. All other attachments in the current minibatch (size B) are used as negatives without additional markup. The temperature parameter τ controls the sharpness of the distribution. This approach, also known as multiple negatives ranking loss, is computationally efficient because it eliminates the need for explicit formation of hard negatives pairs:

L_{s y n} = - \log \frac{e x p (s i m (e_{i}, e_{i}^{+}) / τ)}{\sum_{j} e x p (s i m (e_{i}, e_{j}) / τ)}

(6)

Loss of hierarchical relations (

L_{h i e r}

) applies to BT/NT pairs of the thesaurus. For each such pair (

t_{c h i l d}

,

t_{p a r e n t}

):

e_{c h i l d} = {E_{θ} (s (t_{c h i l d}))}_{[C L S]}

—attachment of a narrower concept;

e_{p a r e n t} = {E_{θ} (s (t_{p a r e n t}))}_{[C L S]}

—attachment of a broader concept; and

e_{n e g} = {E_{θ} (s (t_{n e g}))}_{[C L S]}

—attachment of a randomly selected concept

t_{n e g} \in T

, not related to the

t_{c h i l d}

relation BT/NT. Margin loss penalizes the model if the cosine similarity between the child and parent concepts does not exceed the similarity of the child with the negative example by at least the following margin

L_{h i e r} = m a x (0, m a r g i n - s i m (e_{c h i l d}, e_{p a r e n t}) + s i m (e_{c h i l d}, e_{n e g}))

(7)

The loss of classification of relations (

L_{r e l}

) applies cross-entropy to the vector formed by the concatenation of three components: attachments

e_{i}

and

e_{j}

both concepts, as well as their element-by-element product (the product of Hadamard)

e_{i} ⊙ e_{j}

, defined as (

e_{i} ⊙ e_{j}

)_k =

({e_{i})}_{k} \cdot {(e_{j})}_{k}

for each dimension k. The resulting vector [

e_{i}

;

e_{j}

;

e_{i} ⊙ e_{j}

] has dimension 3d and is fed to a linear classification layer with a weight matrix W

P (r| t_{i}, t_{j}) = s o f t m a x (W [e_{i}; e_{j}; e_{i} ⊙ e_{j}])

(8)

L_{r e l} = - \sum_{r \in R} y_{r} \log P (r | t_{i}, t_{j})

(9)

All three loss functions are combined with equal weights

(λ_{1} = λ_{2} = λ_{3} = 1 / 3)

based on the results of the adjustment on the validation set. The following paper sets out a strategy for generating negative examples. In the case of

L_{s y n}

, the in-batch negatives are defined as B = 32, and the multiple negatives ranking loss is set to τ = 0.05. In order to calculate

L_{h i e r}

, the following procedure is required: one random negative

t_{n e g}

is selected from T, with the proviso that all concepts linked to the anchor child concept via BT/NT/RT/UF are excluded. The margin is set at 0.3. In the context of

L_{r e l}

, negative pairs are derived from non-overlapping thesaurus subtrees, with a 1:1 ratio of positives to negatives stipulated for each class. The choice of hyperparameters followed the established best practice of fine-tuning BERT-like models. The AdamW optimizer was chosen according to Loshchilov & Hutter (2019) [40], which showed superiority over Adam with L2 regularization for transformer models. The learning rate of 2 × 10⁻⁵ is determined by the search results on the grid {1 × 10⁻⁵, 2 × 10⁻⁵, 3 × 10⁻⁵, 5 × 10⁻⁵} and is within the range recommended by Devlin et al. (2019) [41] for fine-tuning BERT. The regularization of the scale 0.01 corresponds to the default value in standard implementations of convertible fine-tuning and provides sufficient regularization for datasets of this scale. The packet size 32 is chosen, taking into account the balance between GPU memory usage and the sampling quality of intra-packet negatives for the contrast target synonymous alignment criterion, following the results of Gao et al. (2021) [42] for SimCSE-style training. Training was conducted for a maximum of 10 epochs with an early stop (patience = 3) on F1 on the validation set; quality reached the plateau between the 7th and 9th epochs in all model configurations.

4.3. Dataset Formation

The complete thesaurus dataset of 18,400 examples was broken down into training, validation, and test subsets by stratified sampling to preserve the proportional representation of all six relational components and all three languages. The partitioning ratio is 80/10/10, which gives 14,720 training, 1840 validation and 1840 test examples. Separation at the level of concepts between subsets was ensured by force: no concept of the thesaurus, present in either the validation or test sections, was included in the training set. The validation section was used for an early stop (F1), setting hyperparameters (learning speed and λ weights) and selecting the model by epochs. In addition to the thesaurus assessment, a separate deferred test set of 250 reference–concept pairs extracted from 200 annotated military operational documents was used as the primary benchmark for evaluating the quality of concept binding. This document test set was built independently of thesaurus sections and contains real text mentions from military operational texts, rather than thesaurus records, which allows for estimating generalization to authentic intra-domain use. Table 2 describes the six types of dataset components, and Table 3 contains the quantitative composition of the enclosure (see Table 2, Table 3 and Table 4).

Table 4 describes the five tasks that make up a multitasking scheme for learning and evaluating the Thesaurus BERT-MIL framework. The first four problems (synonym detection, hyperonym prediction, relationship classification and definition understanding) are educational: each of them corresponds to a separate learning signal extracted from the thesaurus structure and is optimized within the framework of the multitasking loss function (Equation (5)). Detection of synonyms ensures the invariance of attachments to the lexical variability of terms. The hyperonym prediction forms the hierarchical geometry of the attachment space. Classification of relations teaches the model to distinguish between types of semantic bonds (BT, NT, RT, and UF). Understanding a definition aligns the embedding of a term with a vector representation of its definition. The fifth line, “Concept binding”, describes the final inference problem in which a trained encoder is applied to real text mentions from operational documents; the result is measured by ranking metrics (F1, Top-1 and Top-5).

5. Experiments and Results

The experimental design distinguishes two levels of evaluation. At the system level, the full Thesaurus BERT-MIL pipeline (Figure 1) employs BGE-M3 as a fixed bi-encoder for efficient FAISS-based candidate retrieval, and XLM-Roberta as a cross-encoder for final reranking. At the encoder level, the cross-encoder component is instantiated with three alternative backbones—KazBERT, mBERT, and XLM-Roberta—to assess how the choice of pre-trained architecture affects concept linking and relation classification performance. BGE-M3 is retained as the bi-encoder in all configurations and is not varied across experiments.

Three pre-trained transformer architectures are evaluated under two conditions: standard fine-tuning (basic ft) using only mention–concept pairs without structured thesaurus input; thesaurus fine-tuning (ft + Thesaurus) within the full Thesaurus BERT-MIL conveyor. Basic models: KazBERT—monolingual BERT on Kazakh–Russian buildings; mBERT—12-language checkpoint Google; and XLM-Roberta base—multilingual model for 100 languages. The test set contains 250 reference–concept pairs of 200 annotated operational documents (see Table 5). As illustrated in Table 5, the mean ± standard deviation across five runs is presented. The significance of each result against the thesaurus is tested using bootstrap resampling (N = 10,000; significance level α = 0.05) [43,44].

It is noteworthy that the absolute increase in F1, due to thesaurus enrichment, is stable on all three architectures (+0.04 F1, +0.05 Top-1 Accuracy). We interpret this pattern as evidence that relational information encoded in the thesaurus (BT/NT/RT) forms a task-specific information signal orthogonal to the total capacity of the model: since none of the three pre-trained models had access to the terminological hierarchy during pre-training, the structured input pattern provides approximately equal information additions regardless of the encoder size. This conclusion is consistent with the theoretical justification of knowledge injection approaches, which assert that relational domain knowledge cannot be approximated by distributive pre-training, and therefore, provide a stable, architectural-neutral improvement when directly injected into the input representation.

Table 5 presents the complete results of the concept linking assessment. Figure 2 visualizes accuracy, recall and F1-measure across all configurations. Thesaurus enrichment consistently improves all metrics for all three architectures.

The best configuration—XLM-Roberta with thesaurus addition—reaches F1 = 0.84 and Top-5 = 0.94. KazBERT shows the largest absolute increase (+4 points F1), which is due to the limited vocabulary of pre-training and, accordingly, the greater information value of the structured context of the thesaurus.

Figure 3 presents the accuracy of the Top-1 and Top-5 for concept search. All thesaurus-enriched models exceed the Top-5 threshold = 0.90, and XLM-Roberta reaches 0.94.

The gap between Top-1 and Top-5 (~0.14–0.17 for all models) reflects the true terminological ambiguity of the test set, motivating the application of the candidate reranking step in the production pipeline.

Table 6 and Figure 4 present the results of the relationship classification.

As demonstrated in Table 7, there are several discernible patterns. Initially, when utilized in isolation,

L_{s y n}

yields the optimal single-task outcome (F1 = 0.78); the incorporation of

L_{h i e r}

results in an F1 enhancement of +0.03, and the comprehensive three-task amalgamation yields an additional +0.03. Secondly, in relation to relation classification,

L_{r e l}

alone achieves a Macro-F1 of 0.79, but the three-task model outperforms it by +0.05. Thirdly, it was demonstrated that every combination of two tasks outperforms the best single-task alternative, thus confirming the hypothesis of cross-task regularization.

As demonstrated in Table 8, a discernible performance gradient is evident across the various relation types. The highest F1 score (0.91) is achieved by a synonym search. It is evident that the BT search (F1 = 0.85) is significantly enhanced by the utilization of the explicit [BT] token. The utilization of definition-based queries (F1 = 0.86) is facilitated by the [DEF] field. The two categories that have been identified as being the most problematic are NT search (F1 = 0.77) and RT search (F1 = 0.72). NT errors predominantly manifest as NT–NT confusion, that is to say, when several hyponyms share a single parent BT concept and similar definition text (e.g., BMP-2 and BMP-3 as NTs of “infantry fighting vehicle”). RT errors frequently manifest as confusion between RT and NT. Operationally related concepts may be found in adjacent subtrees of the hierarchy. These findings underscore the necessity for mining strategies for hard-negative examples that explicitly target confusable pairs within a subtree.

The most pronounced improvement in thesaurus context injection is observed in the distinction between hyponymic (NT) and hyperonymic (BT) relationships, which the basic models often confuse by direction. Adding [BT] and [NT] tokens to the input view raises F1 hyponymies from 0.68 to 0.79 for KazBERT. This confirms that the hierarchical metadata of the thesaurus carry information that is not inferred from the superficial forms of terms.

Figure 5 presents the heat map of the absolute increase from thesaurus enrichment for all models and tasks. The increase is remarkably homogeneous (+0.03–0.05 in absolute terms), which indicates the sustainability of the benefit of structured knowledge injection, independent of a specific task. The largest single gain (+0.05) is observed in the accuracy of the Top-1 prediction of the hyperonym for all three types of models.

All three models were fine-tuned on thesaurus-derived training pairs covering Russian, Kazakh, and English. For cross-language evaluation, XLM-Roberta and mBERT—both including Kazakh and English in their pre-education corpus—were additionally evaluated on Kazakh and English test mentions without any language-specific adaptation, which allows for zero-shot cross-language translation. KazBERT, whose pre-learning is limited to Kazakh and Russian, was evaluated only on Russian and Kazakh test mentions; the English score for this model was excluded because zero-shot translation into English is not supported by its target pre-learning criterion. The results of cross-language F1 are presented separately in the Interlingual Generalization Analysis (see Figure 4). The narrow ranges of standard deviations (σ ≤ 0.011 across all metrics and configurations) indicate that the framework’s behavior is stable with respect to the choice of a random initial value. The bootstrap resampling test is a statistical technique that is employed to verify the statistical significance of enhancements achieved through thesaurus enrichment. This verification is conducted at a significance level of α = 0.05, which is a commonly accepted threshold for statistical analysis.

Computational Efficiency

Although the primary focus of this study is retrieval quality, we report indicative efficiency figures for completeness. The bi-encoder FAISS retrieval stage operates over an offline-indexed corpus of ~4500 concept embeddings (d = 768) and completes in 2–4 ms per query on CPU, making it negligible in the pipeline. The cross-encoder XLM-RoBERTa reranking stage, which performs pairwise scoring of top-k = 300 candidates, requires approximately 1.8–2.2 s per query with single-sample inference on a V100 GPU. Through batch inference (batch size = 32) and fp16 precision, latency can be reduced to 150–250 ms per query, which is operationally acceptable for interactive terminology lookup. FAISS index construction for the full concept corpus takes under 30 s and needs only to be performed once per thesaurus update. These figures confirm that the framework is deployable within domain-specific NLP systems operating on moderate-scale terminological inventories. For larger inventories, approximate FAISS indices (e.g., IVF) and tighter candidate pruning (top-k = 50) are straightforward extensions. We leave a systematic latency–accuracy trade-off study to future work.

6. Discussion

The statement about the applicability of the proposed framework to any BERT-like architecture is based on three architectural properties common to all models of the given family. Firstly, subword tokenizers in all BERT-like models support the dictionary extension, allowing for the addition of structural tokens ([BT], [NT], [REL], etc.) without changing the weight of the model. Second, all BERT-like encoders generate a [CLS] representation that serves as an attachment of the sequence level; the framework relies solely on this representation for both learning (through three loss functions) and output (through cosine similarity in search), making it independent of the depth or width of the model. Third, loss functions are standard differentiable criteria (contrast, margin, and cross entropy) applicable to any investment made by the encoder. This architectural neutrality is consistent with the results of similar entry-level injection frameworks: SapBERT [25] and SimCSE [42] have been successfully applied to several architecturally different BERT-like encoders. The stability of the improvement delta, observed in our experiments on three architecturally different models (KazBERT, mBERT, and XLM-Roberta), serves as a direct empirical confirmation of this statement.

Use of multitasking learning, combining synonym alignment goals, hierarchical relationship prediction, and association classification, demonstrates higher efficiency compared to single-tasking fine-tuning for each task individually.

The advantage of multitasking learning is due to cross-tasking regularization: co-learning on synonym alignment (

L_{s y n}

) and prediction of hierarchical relations (

L_{h i e r}

) prevents the collapse of synonymous attachments in ways that violate hierarchical order, while the criterion for the classification of relations (

L_{r e l}

) forces the encoder to maintain discriminatory geometry between BT, NT and RT pairs. The present study examined single-task fine-tuning for synonym alignment only (

L_{s y n}

), single-task fine-tuning for relationship classification only (

L_{r e l}

) and full multitask fine-tuning with combined criteria (

L = L_{s y n} + L_{h i e r} + L_{r e l}

). Multi-task fine-tuning consistently surpasses both single-tasking variants in F1 concept binding and Macro-F1 relationship classification across all three model architectures. The increase in multitasking quality (about +0.05–0.07 F1 relative to the best single-tasking variant) reflects cross-tasking regularization: simultaneous optimization for synonym alignment and hierarchical relations prediction prevents the adoption of representations optimal for one criterion but suboptimal for another, generating a geometrically more consistent attachment space supporting both search and classification. This mutual gain is well documented in the literature on multitasking NLP [44,45,46].

The per-class performance gradient reported in Table 8—SYN (F1 = 0.91) > DEF (0.86) > BT (0.85) > NT (0.77) > RT (0.72)—merits explicit interpretation. This ordering is not a sign of framework inadequacy; it reflects the intrinsic semantic properties of the respective relation types. Synonym relations are directly and exhaustively specified in the thesaurus UF records, and the synonym alignment loss (

L_{s y n}

) trains the model specifically to align their embeddings, producing high absolute performance. NT errors arise principally from co-hyponym confusion—multiple narrower terms sharing the same parent BT node and similar definitional context—a problem that would require explicit hard-negative mining within subtrees to reduce further. RT errors reflect the fundamental semantic underspecification of associative relations under ISO 25964; RT links are expert-asserted and non-hierarchical, and operationally related concepts frequently occur in adjacent subtrees, making them genuinely difficult to distinguish from NT links using surface and hierarchical cues alone. Crucially, the framework’s central claim is evaluated on the improvement relative to standard fine-tuning baselines, not on absolute per-class F1. Thesaurus enrichment improves NT F1 from 0.68 to 0.79 (KazBERT) and yields consistent gains for RT across all architectures, confirming that the typed relational signal provides meaningful information even for the hardest relation category. We identify subtree-aware hard-negative mining as the most promising avenue to further close the SYN–RT performance gap.

Despite the results obtained, it is necessary to note a number of limitations of the conducted study. First, the test set was formed on the basis of a single institutional thesaurus and, therefore, may not fully reflect the terminological diversity of the various branches of the military. Second, the current version of the thesaurus is characterized by relatively limited coverage of such rapidly developing areas as electronic warfare, cyber operations and space systems, the terminology of which continues to be actively developed. Third, the use of standard metrics of accuracy and completeness relative to a fixed standard does not fully reflect the practical value of the system in semantic search tasks, where the relevance of the results is often graded. In the future, it seems advisable to develop an evaluation system based on graded relevance, involving expert evaluation of search results.

Fourth, the training dataset, although comprising 18,400 examples, was generated automatically from a single thesaurus of approximately 500 concepts. The relatively small number of distinct concepts means that the model’s generalization to novel, out-of-thesaurus terminology has not been empirically tested. While the cross-lingual evaluation provides partial evidence of generalization, extension to new concept types would require retraining or incremental fine-tuning. Fifth, the framework assumes that the input thesaurus is correctly structured and validated according to the Zthes schema. In practice, real-world terminological resources may contain inconsistencies in BT/NT assignments or redundant RT links, and the sensitivity of the model’s relation classification performance to such annotation noise has not been evaluated. Sixth, the reranking step relies on a fixed candidate set size (top-k = 300 from the bi-encoder). The sensitivity of final performance to the choice of k—and whether this threshold remains appropriate as the thesaurus scales—was not systematically studied. Finally, although the framework is described as applicable to any BERT-like architecture, the empirical evaluation is limited to three models (KazBERT, mBERT, and XLM-RoBERTa). Larger-scale multilingual models such as XLM-RoBERTa-large or mDeBERTa, which may offer higher representational capacity, were not evaluated due to computational constraints.

Fifth, the framework’s [NONE] placeholder mechanism for absent Zthes fields was validated descriptively (fewer than 3% of records required truncation), but no controlled experiment was conducted to assess the degradation in concept linking or relation classification performance as a function of field completeness. In practice, newly created or partially curated thesaurus entries may lack ScopeNotes, associative links, or narrower terms. Future work should include a systematic ablation in which fields are progressively masked at inference time—e.g., removing RT, then NT + RT, then all fields except Term and SYN—to characterize the model’s graceful degradation profile and identify which missing fields are most detrimental to each task.

In a broader perspective, the proposed methodology can be applied to any structured terminological resource that complies with the international standard ISO 25964, which formalizes the organization of thesaurus and includes the Zthes format. Potential applications are also the terminology databases of the North Atlantic Treaty Organization member states used to implement STANAG standardization agreements. In such systems, thesaurus plays an important role in ensuring doctrinal and information compatibility.

In addition, the proposed framework can be extended to solve terminology management problems in multinational coalition contexts, where the alignment of concepts between different institutional thesauri remains a constant problem of semantic interoperability. The integration of structured terminological resources with neural language models opens up opportunities for creating semantic search and analysis systems that can take into account both distributed text representations and formalized knowledge of the subject area.

7. Conclusions

This paper presents a framework for integrating the structured knowledge of the military thesaurus into pre-trained transformational language models in order to solve the problems of binding concepts and semantic search on the terminology of the Ground Forces. The proposed framework implements the systematic mapping of the elements of the Zthes thesaurus into text representations compatible with the input of transformer models, uses a multitasking fine-tuning model with three complementary learning goals, and also includes a pipeline of dense search based on the FAISS index for effective binding of concepts at the output stage.

The experiments conducted on the test set formed on the basis of the terminological database of the Ground Forces of the Armed Forces of the Republic of Kazakhstan demonstrate steady improvements in relation to the baselines of standard fine-tuning. The best configuration—the XLM-Roberta model with a complete thesaurus addition—reaches F1 = 0.84 and Top-5 = 0.94 in the concept binding problem, and Macro-F1 = 0.84 in the relationship classification problem. The diagrams presented in Section 5 further confirm the consistent and architecturally independent nature of the improvements achieved through the integration of thesaurus knowledge.

The scientific novelty of the research is determined by three main contributions. First, it is proposed to systematically map elements of the Zthes standard into representations suitable for use in transformer models, which makes the approach applicable to any terminological resource that meets the ISO 25964 standard. Second, one of the first empirical evaluations of the methods of transformational linking concepts and semantic search for military terminology of the Ground Forces in a trilingual environment (Russian, Kazakh and English) was performed. Thirdly, a reproducible multitasking learning protocol has been developed that surpasses single-tasking baselines in all the metrics under consideration without the need to modify the architecture of the base model.

Promising areas of further research include the expansion of the thesaurus to the fields of rapidly developing military technologies, the development of a protocol for evaluating semantic search based on graded relevance, as well as the study of methods of continuous learning that allow for incrementally updating the space of investment concepts as the institutional thesaurus evolves. In a broader context, the proposed approach opens up opportunities for integrating knowledge organization systems and neural language models in the tasks of semantic search, terminology management and interlinguistic interoperability in specialized subject areas.

Author Contributions

Conceptualization, B.A. and M.S.; methodology, B.A. and M.S.; software, R.T. and B.A.; validation, S.T., A.Y. and M.S.; formal analysis, E.D. and M.A.; investigation, M.S. and S.T.; resources, M.A., E.D. and A.Y.; data curation, B.A. and R.T.; writing—original draft preparation, B.A. and A.Y.; writing—review and editing, M.S.; visualization, A.S., E.D. and B.A.; supervision, A.Y.; project administration, A.Y. and M.S.; funding acquisition, A.Y. and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan, Grant AP23484329.

Data Availability Statement

The original contributions presented in this study are included in the article, Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT applications in natural language processing: A review. Artif. Intell. Rev. 2025, 58, 166. [Google Scholar] [CrossRef]
Abdygalym, B.; Sambetbayeva, M.; Yerimbetova, A.; Nekessova, A.; Tasbolatuly, N.; Smailov, N.; Nazymkhan, A. NLP Models for Military Terminology Analysis and Detection of Information Operations on Social Media. Computers 2025, 14, 485. [Google Scholar] [CrossRef]
Mathur, V.; Dadu, T.; Aggarwal, S. Evaluating neural networks’ ability to generalize against adversarial attacks in cross-lingual settings. Appl. Sci. 2024, 14, 5440. [Google Scholar] [CrossRef]
Mannion, A.; Schwab, D.; Goeuriot, L. UMLS-KGI-BERT: Data-centric knowledge integration in transformers for biomedical Entity recognition. In Proceedings of the 5th Clinical Natural Language Processing Workshop, Toronto, ON, Canada, 14 July 2023; pp. 312–322. [Google Scholar]
Lai, T.M.; Zhai, C.; Ji, H. KEBLM: Knowledge-enhanced biomedical Language models. J. Biomed. Inform. 2023, 143, 104392. [Google Scholar] [CrossRef]
Rakhimova, D.; Turarbek, A.; Karyukin, V.; Sarsenbayeva, A.; Alieyev, R. Legal AI in Low-Resource Languages: Building and Evaluating QA Systems for the Kazakh Legislation. Computers 2025, 14, 354. [Google Scholar] [CrossRef]
Ullah, F.; Gelbukh, A.; Zamir, M.T.; Riverόn, E.M.F.; Sidorov, G. Enhancement of named entity recognition in low-resource languages with data augmentation and BERT models: A case study on Urdu. Computers 2024, 13, 258. [Google Scholar] [CrossRef]
Zhukabayeva, T.; Ahmad, Z.; Yerimbetova, A.; Sambetbayeva, M.; Telman, D.; Bayangali, A.; Daiyrbayeva, E. A comprehensive Review of NLP techniques for Military Terminologies and Information Operations on Social Media. IEEE Access 2025, 13, 154930–154947. [Google Scholar] [CrossRef]
Liu, X.; Yu, Z.; Liu, X.; Miao, L.; Yang, T. Military equipment entity extraction based on large language model. Appl. Sci. 2024, 14, 9063. [Google Scholar] [CrossRef]
Zabala-López, A.; Linares-Vásquez, M.; Haiduc, S.; Donoso, Y. An analytical approach to named entity recognition for military aerospace intelligence. Decis. Anal. J. 2025, 16, 100613. [Google Scholar] [CrossRef]
He, Y.; Zhu, Z.; Zhang, Y.; Chen, Q.; Caverlee, J. Infusing disease knowledge into BERT for health question answering, medical inference and disease name recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4604–4614. [Google Scholar]
Naseem, U.; Khushi, M.; Reddy, V.; Rajendran, S.; Razzak, I.; Kim, J. Bioalbert: A simple and effective pre-trained language model for biomedical named entity recognition. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2021; pp. 1–7. [Google Scholar]
Li, Y.; Liu, L.; Zheng, J. A method for named entity recognition in military intelligence domain using large language models. J. Electron. Inf. Technol. 2026, 48, 662–672. [Google Scholar]
Adha, R.I.; Mardamsyah, A.; Phatoni, K.I. Big data analytics framework for defense strategic intelligence and decision support systems. J. Def. Technol. Eng. 2026, 1, 113–128. [Google Scholar]
Zthes. The Zthes Specifications for Thesaurus Representation, Access and Navigation. Available online: https://zthes.z3950.org/?utm_source=chatgpt.com (accessed on 5 February 2026).
Sajun, A.R.; Zualkernan, I.; Sankalpa, D. A historical survey of advances in transformer architectures. Appl. Sci. 2024, 14, 4316. [Google Scholar] [CrossRef]
Zhang, H.; Shafiq, M.O. Survey of transformers and toward ensemble learning using transformers for natural language processing. J. Big Data 2024, 11, 25. [Google Scholar] [CrossRef]
Luo, X.; Deng, Z.; Yang, B.; Luo, M.Y. Pre-trained language models in medicine: A survey. Artif. Intell. Med. 2024, 154, 102904. [Google Scholar] [CrossRef]
Carmona, V.S.; Jiang, S.; Dong, B. A multilevel analysis of PubMed-only Bert-based biomedical models. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Mexico City, Mexico, 21 June 2024; pp. 105–110. [Google Scholar]
Rostam, Z.R.K.; Kertész, G. Fine-tuning large language models for scientific text classification: A comparative study. In Proceedings of the 2024 IEEE 6th International Symposium on Logistics and Industrial Informatics (LINDI); IEEE: New York, NY, USA, 2024; pp. 233–238. [Google Scholar]
Licari, D.; Comandè, G. ITALIAN-LEGAL-Bert models for improving natural language processing tasks in the Italian legal domain. Comput. Law Secur. Rev. 2024, 52, 105908. [Google Scholar] [CrossRef]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1441–1451. [Google Scholar]
Cadeddu, A.; Chessa, A.; De Leo, V.; Fenu, G.; Motta, E.; Osborne, F.; Recupero, D.R.; Salatino, A.; Secchi, L. A comparative analysis of knowledge injection strategies for large language models in the scholarly domain. Eng. Appl. Artif. Intell. 2024, 133, 108166. [Google Scholar] [CrossRef]
Gupta, S.; Ranjan, R.; Singh, S.N. A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions. arXiv 2024, arXiv:2410.12837. [Google Scholar]
Gnecco, D.P.; Serrano, J.; Puertas, E.; Martinez-Santos, J.C. Hybrid re-ranking for biomedical entity linking using SapBERT embeddings: A high-performance system for BioNNE-L 2025-1. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF 2025), Madrid, Spain, 9–12 September 2025. [Google Scholar]
ISO 25964; Focus Group (NKOS Workshop). National Information Standards Organization (NISO): Baltimore, MD, USA, 2013. Available online: https://www.niso.org/schemas/iso25964 (accessed on 26 January 2026).
SKOS Simple Knowledge Organization System Reference. Available online: https://www.w3.org/TR/skos-reference/ (accessed on 13 February 2026).
Aggarwal, T.; Salatino, A.; Osborne, F.; Motta, E. Large language models for scholarly ontology generation: An extensive analysis in the engineering field. Inf. Process. Manag. 2026, 63, 104262. [Google Scholar] [CrossRef]
Kraus, F.; Blumenröhr, N.; Tonne, D.; Streit, A. Mind the Language gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri. arXiv 2025, arXiv:2507.19537. [Google Scholar]
Teze, V.; Nazaruka, E. Future directions in Defence NLP: Investigating Research gaps for Low-Resource Languages. In The International Baltic Conference on Digital Business and Intelligent Systems; Springer: Cham, Switzerland, 2024; pp. 93–105. [Google Scholar]
Abdygalym, B.K.; Adali, E.; Sambetbayeva, M.A.; Sadirmekova, Z.B.; Nazymkhan, A.A. A conceptual model for ontology-based detection of information operations in digital media. Bull. Shakarim Univ. Tech. Sci. 2025, 1, 36–44. [Google Scholar]
Xu, Z.; Mo, F.; Huang, Z.; Zhang, C.; Yu, P.; Wang, B.; Lin, J.; Srikumar, V. A survey of model architectures in information retrieval. arXiv 2025, arXiv:2502.14822. [Google Scholar] [CrossRef]
Reichman, B.; Heck, L. Dense passage retrieval: Is it retrieving? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 13540–13553. [Google Scholar]
Nguyen, H.; Le, T.H. Enhancing Colbert: A method for reducing Space complexity and accelerating Retrieval Speed. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, Tokyo, Japan, 7–9 December 2024; pp. 820–829. [Google Scholar]
Banar, N.; Lotfi, E.; Daelemans, W. BEIR-nl: Zero-shot information retrieval benchmark for the dutch language. arXiv 2024, arXiv:2412.08329. [Google Scholar]
NATOTerm. Available online: https://nso.nato.int/natoterm/content/nato/pages/home.html?lg=en (accessed on 15 January 2026).
Li, Z.; Shimada, K. Semantic meaning or Script Shape? A Comparative Study of Cross-lingual Transfer in mBERT and PIXEL. In Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation, Hanoi, Vietnam, 5–7 December 2025; pp. 556–564. [Google Scholar]
Li, H.; Cai, D.; Qu, Z.; Cui, Q.; Kamigaito, H.; Liu, L.; Watanabe, T. Cross-lingual contextualized phrase retrieval. J. Nat. Lang. Process. 2025, 32, 886–917. [Google Scholar] [CrossRef]
Elmahdy, A.; Lin, S.C.; Ahmad, A. Synergistic approach for simultaneous optimization of monolingual, cross-lingual, and multilingual information retrieval. arXiv 2024, arXiv:2408.10536. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman and Hall/CRC: Boca Raton, FL, USA, 1994. [Google Scholar]
Dror, R.; Baumer, G.; Shlomov, S.; Reichart, R. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1383–1392. [Google Scholar]
Chen, S.; Zhang, Y.; Yang, Q. Multi-task learning in natural language processing: An overview. ACM Comput. Surv. 2024, 56, 1–32. [Google Scholar] [CrossRef]
Zhang, C.; Tu, L.; Yang, Z.; Du, B.; Zhou, Z.; Wu, J.; Chen, L. A CMMOG-based lithium-battery SOH estimation method using multi-task learning framework. J. Energy Storage 2025, 107, 114884. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed thesaurus relationship classification pipeline. Arrows denote the direction of data flow between the main stages of the framework, including data preparation, model training, indexing, and final relation ranking.

Figure 2. Concept binding: Accuracy, recall and F1-measure—Basic vs. +Thesaurus.

Figure 3. Top-1 and Top-5 accuracy—Basic vs. Thesaurus-enriched models.

Figure 4. Relationship classification: Macro precision, recall and F1—no context vs. with thesaurus.

Figure 5. Absolute growth of metrics from thesaurus enrichment (Thesaurus—Basic) by models and problems.

Table 1. The elements of the Zthes thesaurus in the transformer views.

Element Zthes	Semantic Role	Introduction to the Transformer	Example (Army)
Term	Basic concept	[TERM] <text>	Infantry fighting vehicle
TermID	Unique identifier	[ID_<code>]	[ID_MIL001]
UF (use for)	Synonyms	[SYN] <list>	[SYN] infantry fighting vehicle
BT (Broader term)	Hyperonym	[BT] <Parent term>	[BT] fighting equipment
NT (Narrower term)	Hyponym	[NT] <child term>	[NT] BMP-2; BMP-3
RT (related term)	Associative relationship	[REL] <term>	[REL] Motorized rifle unit
ScopeNote	Definition	[DEF] <text>	[DEF] Armored Track machine…
Language	Language tag	[LANG_XX]	[LANG_RU]; [LANG_KK]; [LANG_EN]
Source	Source of knowledge	[SRC_<code>]	[SRC_MIL_DICT]

Table 2. Dataset components extracted from the Ground Forces thesaurus.

Dataset Component	Source in Thesaurus	Training Objective	Example
A pair of synonyms	UF (use for)	Align synonyms	BMP ↔ Infantry fighting vehicle
Relationship of hyperonymy	BT (Broader term)	Prediction of hyperonym	BMP → combat equipment
The ratio of hyponymy	NT (Narrower term)	Ranking of hynickyms	Military equipment → BMP-2
Associative attitude	RT (related term)	Classification of relationships	BMP ↔ motorized rifle unit
Definition text	ScopeNote	Contextual coding	BMP is an armored vehicle for transporting infantry
Negative couple	Random sampling	Contrasting negative	BMP is a radar station

Table 3. The quantity of the training body.

Dataset Component	Number of Samples	Example
Terms (thesaurus concepts)	4500	military installations
Synonym pairs (UF)	3200	APC is an armored personnel carrier
Hyperonymy relationships (BT)	2100	tank and armored vehicles
Hyponymy ratios (NT)	2300	The T-72 tank
Associative relationships (RT)	1800	artillery ↔ fire support
Definition texts (ScopeNote)	4500	text descriptions of concepts
Negative couples (random)	4500	random unrelated pairs
Total training samples	~18,400	-

Table 4. Multi-tasking learning and assessment tasks.

Training Objective	Input Format	Purpose of the Model	Evaluation Metric
Detect synonyms	term₁, term₂	Determine equivalence	Accuracy/F1
Prediction of hyperonym	child term + context	Predict the parent concept	Top-k Accuracy
Classification of relationships	term₁ + term₂	Classify the relationship type	Precision/Recall/F1
Understanding the definition	Term + ScopeNote	Semantic alignment	Cosine similarity
Concept Binding (Output)	mention from the text	Extract the concept of thesaurus	F1, Top-1, Top-5

Table 5. Concept binding results (Test set, N = 250 instances).

Model	Training Condition	Precision	Recall	F1	Top-1	Top-5
KazBERT	Basic (ft)	0.81 ± 0.008	0.77 ± 0.009	0.79 ± 0.007	0.74 ± 0.010	0.90 ± 0.006
KazBERT	FT + Thesaurus	0.84 ± 0.006	0.81 ± 0.007	0.83 ± 0.005	0.79 ± 0.008	0.93 ± 0.005
MBERT	Basic (ft)	0.79 ± 0.009	0.75 ± 0.010	0.77 ± 0.008	0.71 ± 0.011	0.88 ± 0.007
MBERT	FT + Thesaurus	0.82 ± 0.007	0.79 ± 0.008	0.80 ± 0.006	0.76 ± 0.009	0.91 ± 0.006
XLM-Roberta	Basic (ft)	0.83 ± 0.007	0.78 ± 0.008	0.80 ± 0.006	0.75 ± 0.009	0.92 ± 0.005
XLM-Roberta	FT + Thesaurus	0.86 ± 0.005	0.82 ± 0.006	0.84 ± 0.005	0.80 ± 0.007	0.94 ± 0.004

Table 6. The results of the classification of relations are macro-averaged metrics.

Model	Input Format	Macro Precision	Macro Recall	Macro F1
KazBERT	term₁ [SEP] term₂ (no context)	0.78	0.74	0.76
KazBERT	term₁ + (UF/DEF) [SEP] term₂ + (BT/RT)	0.82	0.80	0.81
MBERT	term₁ [SEP] term₂ (no context)	0.76	0.72	0.74
MBERT	term₁ + (UF/DEF) [SEP] term₂ + (BT/RT)	0.79	0.77	0.78
XLM-Roberta	term₁ [SEP] term₂ (no context)	0.83	0.79	0.81
XLM-Roberta	term₁ + the complete context of the thesaurus	0.86	0.83	0.84

Table 7. The following investigation will compare single-task and multi-task training using XLM-RoBERTa (mean ± standard deviation across five runs).

Learning Configuration	F1 Concept Linking	Top-1 Concept Linking	Macro-F1 Relationship Classification
Single-objective: $L_{s y n}$ only	0.78 ± 0.008	0.73 ± 0.010	0.71 ± 0.010
Single-objective: $L_{h i e r}$ only	0.74 ± 0.009	0.69 ± 0.011	0.75 ± 0.009
Single-objective: $L_{r e l}$ only	0.71 ± 0.010	0.65 ± 0.012	0.79 ± 0.008
Two-objective: $L_{s y n}$ + $L_{h i e r}$	0.81 ± 0.006	0.76 ± 0.008	0.78 ± 0.008
Two-objective: $L_{s y n}$ + $L_{r e l}$	0.79 ± 0.007	0.74 ± 0.009	0.80 ± 0.007
Two-objective: $L_{h i e r}$ + $L_{r e l}$	0.78 ± 0.007	0.73 ± 0.009	0.81 ± 0.007
Multi-objective: $L_{s y n}$ + $L_{h i e r}$ + $L_{r e l}$ (proposed)	0.84 ± 0.005	0.80 ± 0.007	0.84 ± 0.005

Table 8. Analysis of errors by relationship type.

Query Relation Type	N (Test)	Precision	Recall	Macro F1	Primary Error Type
SYN (synonym)	68	0.92	0.90	0.91	The surface form of the variant is missing from the UF list
BT (hypernym search)	47	0.87	0.84	0.85	Confusion with a related BT concept
NT (hyponym search)	52	0.79	0.76	0.77	NT–NT confusion within a single parent concept
RT (associative search)	44	0.74	0.70	0.72	RT is confused with NT in a single subtree
DEF (definition-based search)	39	0.88	0.85	0.86	Ambiguous terms with similar definitions
Total (macro avg)	250	0.84	0.81	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdygalym, B.; Tazhibayeva, S.; Sambetbayeva, M.; Yerimbetova, A.; Taberkhan, R.; Abjalova, M.; Sabdenov, A.; Daiyrbayeva, E. Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts. Computers 2026, 15, 297. https://doi.org/10.3390/computers15050297

AMA Style

Abdygalym B, Tazhibayeva S, Sambetbayeva M, Yerimbetova A, Taberkhan R, Abjalova M, Sabdenov A, Daiyrbayeva E. Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts. Computers. 2026; 15(5):297. https://doi.org/10.3390/computers15050297

Chicago/Turabian Style

Abdygalym, Bayangali, Saule Tazhibayeva, Madina Sambetbayeva, Aigerim Yerimbetova, Roman Taberkhan, Manzura Abjalova, Aidos Sabdenov, and Elmira Daiyrbayeva. 2026. "Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts" Computers 15, no. 5: 297. https://doi.org/10.3390/computers15050297

APA Style

Abdygalym, B., Tazhibayeva, S., Sambetbayeva, M., Yerimbetova, A., Taberkhan, R., Abjalova, M., Sabdenov, A., & Daiyrbayeva, E. (2026). Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts. Computers, 15(5), 297. https://doi.org/10.3390/computers15050297

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Thesaurus-Based Knowledge into Transformer Models for Semantic Understanding of Domain-Specific Texts

Abstract

1. Introduction

2. Review of Literature

3. Ground Forces Thesaurus: Structure and Knowledge Extraction

4. Methodology: Thesaurus BERT-MIL

4.1. Input Presentation and Coding

4.2. Loss Functions

4.3. Dataset Formation

5. Experiments and Results

Computational Efficiency

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI