1. Introduction
The subject-oriented processing of natural language in areas critical to security and related to high-tech disciplines—defense, aerospace and military operations—involves a number of specific methodological problems. In addition to these structural characteristics, the military subject area combines properties that make it a particularly significant and complex context for Natural Language Processing (NLP) studies [
1]. First, in military discourse, terminological inaccuracy is not just an academic inconvenience; it can have operational consequences [
2]. The same term, such as “strike asset”, “tactical unit” or “zone of responsibility”, can refer to different concepts depending on the doctrinal context, chain of command or level of operations (strategic, operational, or tactical). Such polysemy, deeply rooted in institutional use, creates a level of semantic complexity that general-purpose NLP models do in principle poorly [
3]. This sharply distinguishes the military sphere, for example, from the legal or medical NLP [
4,
5], where terminological standards, although strict, are documented more uniformly and studied in corpus linguistics for decades. Secondly, the trilingual operating environment of the Armed Forces of the Republic of Kazakhstan—Kazakh, Russian and English—places high demands on cross-language consistency, which is especially difficult given the low resource status of the Kazakh language [
6,
7] in the NLP community and the almost complete absence of public military corps [
8,
9,
10]. Thirdly, unlike many specialized fields that either lack structured terminological resources or have long been the subject of intensive computational research, the military field occupies an unusual position: it has a well-designed, institutionally supported thesaurus and classification standards, but these resources have not yet been systematically integrated with modern neural architectures. This combination of formalized terminological infrastructure, insufficient research, multilingual complexity and high requirements for accuracy makes the military field not just an example of application, but a rigorous and meaningful test environment for the methodology proposed in this paper. However, their systematic integration with modern neural language models remains an insufficiently studied task.
In fact, this limitation is multidimensional; the main factors are as follows:
Although significant advances have been made in the biomedical NLP community in integrating structured knowledge using resources such as UMLS and Mesh (e.g., KnowledgeBERT [
11]; BioALBERT [
12]), systematic study of thesaurus and transformer model integration in military, defense and security fields is still poorly represented. Existing research in military NLP mainly focuses on the recognition of named entities and extraction of information [
13,
14], while the problems of binding concepts and semantic search based on structured terminological resources remain largely unexplored.
Structured terminological standards such as Zthes [
15] or SKOS (Simple Knowledge Organization System) encode relational knowledge (hierarchical, associative, and definition) in formal XML schemes that are initially incompatible with the requirements of transformational architectures that work with linear text input. At the moment, there is no standardized pipeline for the systematic transformation of thesaurus elements, such as BT (Broader term), NT (Narrower term), RT (related term), and ScopeNote, into linguistically coherent text sequences suitable for the further training of models. This represents a specific technical gap, different from the problem of data scarcity.
The lack of publicly available marked datasets created on the basis of institutional terminological resources in multilingual specialized fields (for example, Kazakh–Russian–English military terminology) further limits reproducible research in this field. Most of the existing benchmarks for knowledge-enriched NLP are English-speaking and focused on a common subject area.
The core research contributions of the work are as follows:
A formalized mapping of elements of the Zthes standard [
15] into text representations compatible with the input of transformer models is proposed, which allows for the systematic injection of structured knowledge without changing the architecture of the base model.
A training dataset of 400 relational examples was formed, automatically extracted from the thesaurus of the Ground Forces of the Armed Forces of the Republic of Kazakhstan and organized into six types of tasks.
A comparative experimental study of three multilingual transform models, such as KazBERT, mBERT and XLM-Roberta, in basic and thesaurus-enriched conditions, with an assessment of accuracy, completeness, F1-measure and top-k accuracy metrics was carried out.
A reproducible test protocol for the problem of binding concepts in the military domain, demonstrating stable interlingual generalization between Russian, Kazakh and English languages, is proposed.
2. Review of Literature
The emergence of large pre-trained language models based on the Transformer architecture [
16] fundamentally changed the paradigm of natural language processing. BERT [
1] and Roberta [
17] models have demonstrated high efficiency in a wide range of tasks thanks to self-controlled pre-training on large text cases. However, when applied to highly specialized domains, general pre-education is insufficient, since it does not provide adequate coverage of professional terminology and the conceptual specificity of subject texts.
In response to this problem, a strategy for domain adaptation was proposed through additional training of models on specialized cases. BioBERT [
18] and PubMedBERT [
19] models formed this paradigm for the biomedical domain, demonstrating steady improvements in general domain models in the tasks of recognizing named entities, extracting relationships, and answering questions. SciBERT [
20] extended this approach to scientific texts, while LegalBERT [
21] adapted the transformer architecture for the legal domain. In the military context [
8], it was shown that domain pre-training provides an increase in accuracy by 8–12 points, in terms of F1, for entities of the “technics” and “compounds” types in Russian-language military texts. Nevertheless, these models share a fundamental limitation: they internalize statistical patterns of texts, but do not form explicit, structured representations of domain knowledge, such as hierarchical, associative, and determinative relationships, which are systematically fixed in controlled dictionaries and thesauri.
Existing approaches to enriching neural language models with external knowledge can be conditionally divided into two main directions: the injection of knowledge at the pre-learning stage and the injection of knowledge at the stage of fine-tuning. Within the first direction, the ERNIE model [
22] became one of the first systems to implement the injection of knowledge at the entity level by masking the elements of the knowledge base in the pre-learning process. The K-BERT architecture proposed the inclusion of knowledge triplets directly into the input sentence graph, whereas the KEPLER model [
23] combined masked language modeling and learning of knowledge representations into a single multitasking statement, ensuring alignment of distributed and structured representations.
Among the methods of the second direction, the retrieval-augmented learning architecture became important in the stages of development. The rag model [
24] implements the dynamic extraction of relevant knowledge during inference, and SapBERT [
25] applies self-alignment goals on pairs of synonyms from the UMLS base to train representations of biomedical entities resistant to surface-form variability. Conceptually, this approach is closest to the strategy implemented in this work.
However, a significant limitation of these methods is that they are designed primarily for large universal knowledge graphs, such as Wikidata, Freebase, or UMLS. These systems do not address the task of integrating a specialized domain-controlled thesaurus with a complex hierarchical and associative structure characteristic of professional areas with limited resources.
The use of thesauri and ontologies to enrich NLP systems has a solid theoretical basis in the science of knowledge organization. ISO 25964 [
26] defines a thesaurus as a structured system of controlled indexing language formalizing equivalence relations (USE/UF (Use-For)), hierarchy (BT/NT) and association (RT (Related Term)) between concepts. The SKOS model [
27], developed by the W3C consortium, provides the serialization of these relationships in RDF format, which makes possible machine-readable integration of thesaurus structures into the semantic web infrastructure.
Modern literature demonstrates a steady transition from static alphabetic dictionaries to conceptually oriented, machine-processed knowledge structures. Salatino et al. [
28] showed that the use of thesauri and ontologies significantly increases the suitability of information resources for artificial intelligence and semantic analytics systems. Kraus et al. [
29] proposed an LLM-oriented pipeline for automated translation of SKOS thesaurus between languages and noted the systematic decline in the quality of ontology comparison in the case of non-English-speaking resources, pointing to the need to develop specialized tools for working with multilingual thesauri.
In the military-applied context [
9], we developed a hybrid architecture that combines POS-tagging and neural network methods for automatic extraction of terms from military text cases. Teze and Nazaruka [
30] proposed a method of semantic intersection for cross-linguistic alignment of thesaurus in the defense field using transformational refinement. Ref. [
31] described the Zthes-compliant scheme of the trilingual terminological database of the Ministry of Defense of the Republic of Kazakhstan, providing the immediate context for the present study.
Despite the progress made, most existing approaches use thesaurus primarily as support resources for pre-processing or static reference systems. Hierarchical and associative thesaurus signals (BT/NT/RT) are practically not integrated directly into the process of learning representations of neural models.
The paradigmatic transition from sparse search to dense search, in which queries and documents are encoded in a common space of vector representations and ranked by cosine similarity, was proposed in [
32] and further developed in the architectures by DPR [
33] and Colbert [
34]. The BEIR benchmark [
35] showed that dense search models trained on domain-wide hulls significantly lose quality when transferring to specialized domain collections, which confirms the need for domain-oriented presentation training.
In the military NLP context, the Multi_mil [
2] corpus, which is a multilingual resource in Kazakh, Russian and English, provides an empirical basis for evaluating such systems. The review, published in the journal IEEE Access [
8], captures a steady increase in interest in the problems of named entity recognition, thematic modeling and other NLP approaches for the analysis of military terminology. At the same time, the authors note the lack of structured thesaurus resources as one of the key factors limiting the development of semantic interoperability and improving the quality of information search.
The problem of cross-language terminological alignment is particularly relevant in multilingual professional domains, where concepts do not have direct correspondences between languages due to doctrinal, organizational and cultural differences. In the military field, terms such as Squad or Platoon do not always have direct structural equivalents in various national military systems, which requires contextual and hierarchical alignment of concepts instead of direct lexical translation. NATO’s terminological infrastructure, including NATOTerm [
36], demonstrates an institutional approach to multilingual terminology standardization, in which the thesaurus is not an auxiliary dictionary, but an element of the infrastructure of doctrinal and information compatibility.
At the computational level, multilingual models—mBERT [
37], LaBSE [
38] and XLM-Roberta [
39]—provide zero-shot cross-language translation for classification and search tasks. However, like their monolingual counterparts, they do not use the structured equivalence and hierarchy signals characteristic of a controlled thesaurus. Thus, the integration of SKOS-coded multilingual thesaurus knowledge into the process of further training of cross-language transformer models remains a poorly researched direction.
The analysis of the literature allows us to identify a multidimensional research gap. First, the domain adaptation methods of transformer models, despite their empirical effectiveness, do not provide access to explicit structured terminological knowledge. Secondly, existing knowledge integration frameworks (ERNIE, K-BERT, KEPLER, and SapBERT) are focused mainly on large universal knowledge graphs and do not solve the problem of integrating a user-controlled thesaurus with a full-fledged relational BT/NT/RT/UF structure. The semantic significance of these types of relationships is not interchangeable. BT and NT relations are inherently directional and asymmetric; they encode inclusion hierarchies in which the direction of the relationship determines taxonomic depth and conceptual scope. A model that cannot distinguish between BT and NT will form structurally inverted conclusions by treating the subordinate concept as superfixed, which directly affects the accuracy of the binding concepts. RT relationships, by contrast, encode a non-hierarchical associative affinity that subject experts have explicitly recognized as semantically significant; such associations cannot be reliably restored from distributive co-occurrence in low-resource enclosures, where functionally related terms may rarely appear in general contexts. Existing integration frameworks, including SapBERT and K-BERT, reduce all relational knowledge to a single undifferentiated signal of similarity and therefore cannot teach a model how the two concepts are connected, but only that they are connected. The proposed work removes this limitation by maintaining the typed relational structure of the thesaurus throughout the post-learning process and treating BT, NT, and RT as different learning goals rather than reducing them to a common criterion of closeness. Thirdly, the military NLP, despite the growth in the number of publications, remains insufficiently provided with structured thesaurus resources capable of simultaneously supporting semantic teaching of representations, information search and cross-linguistic interoperability. Fourth, the specific combination of trilingual coverage (Russian–English–Kazakh), standardized thesaurus structure and integration with transformational architecture was practically not considered in the existing literature.
This work is aimed at closing this research gap. The article proposes and experimentally tests the methodology of integrating the structured trilingual domain thesaurus of the terminology of the Ground Forces in the process of completing the training of a multilingual transformer model. The thesaurus provides relational BT/NT/RT signals and synonym pairs that are incorporated into the learning process through a specialized target criterion, ensuring the model’s sensitivity to the conceptual hierarchy and associative structure of military terminology. The proposed approach differs from previous works by the completeness of integration of the main types of thesaurus relations, trilingual coverage and systematic experimental evaluation on the tasks of semantic search and binding concepts in the military domain.
3. Ground Forces Thesaurus: Structure and Knowledge Extraction
The Ground Forces thesaurus used in the present study covers about 4500 concepts relating to various aspects of ground military activities, including Ground Forces equipment, organizational structures, operational concepts, terrain classification, fire support systems, communications and electronic warfare, as well as logistics and control elements.
Although this figure looks modest compared to large general-purpose knowledge bases such as Wikidata or UMLS, it is representative of institutionally maintained, operationally oriented subject thesaurus. In particular, in the field of defense and military terminology, the reported size of thesaurus for national resources created within a single institution usually varies from about 300 to 1500 concepts [
30], which places the resource in question in the middle range of this category. NATOTerm, the most comprehensive multilateral system of defense terminology, covers about 14,000 terms, a scale achievable only through decades of multinational editorial coordination and therefore not a realistic reference point for national-level resources. The 18,400 training examples derived from the thesaurus reflect a deliberately chosen low-resource experimental mode; their structured and typed relational nature distinguishes them from free text annotations and makes each such example a more informational signal than is usually the case in general purpose post-learning datasets.
The Ground Forces Thesaurus was developed by a group of military subject matter specialists and terminologists affiliated with the Ministry of Defense of the Republic of Kazakhstan, as described in [
31]. The construction process was based on a structured expert-oriented methodology, in which candidates for terms were first selected from authoritative doctrinal sources, including official military regulations, operational manuals and standardized catalogs of weapons and equipment published by the Armed Forces of the Republic of Kazakhstan. The selected terms were then verified and validated by subject experts with up-to-date military expertise to ensure doctrinal accuracy and operational relevance. Hierarchical relationships (BT/NT) were set on the basis of established doctrinal classification schemes and further checked for consistency with the official taxonomies of the organizational structure and armament of the Ground Forces. Associative relationships (RT) were determined on the basis of expert consensus of terminologists and reflected functionally and operationally significant conceptual relationships that are not reduced to hierarchical inclusion. Synonymous records (UF) were formed on the basis of variant designations found in official documents in Kazakh, Russian and English. The tri-lingual coverage was provided by parallel expert annotation, rather than automatic translation, to preserve doctrinal equivalence, not just lexical correspondence. The resulting resource was encoded in Zthes XML format and validated for schema compliance prior to use in this study. No automated term extraction tools were used to construct the relational structure; BT, NT, and RT are expert terminological judgments, not statistically derived associations.
Table 1 summarizes the semantic role of each element of the Zthes standard and the corresponding text representation used within the proposed framework. The mapping of the elements is formed in such a way as to preserve the relational information of the thesaurus as much as possible when constructing text sequences compatible with the tokenizers of all three transformational architectures under consideration.
The structured input sequence template for the thesaurus concept is as follows:
This template is a flat linear representation, fully tokenized by the tokenizers of the transformer models used, and automatically restored from any Zthes-compatible XML export without the need for manual data annotation. Special tokens (for example, [term], [SYN], [BT], and [NT]) are added to the model dictionary in the fine-tuning step and initialized as the average of subword attachments that make up the corresponding token.
In the event of the Zthes element being absent, the corresponding structural token is retained; however, the content slot is populated with a [NONE] placeholder. The truncation strategy employed involves the removal of values from the terminal point of multi-valued fields, prioritized in reverse order of importance (RT first, followed by NT). It is guaranteed that at least one value will be preserved for each non-empty field. Truncation was required for less than 3% of records (n = 15). The maximum input sequence length was 512 tokens. The total number of tokens required to reach the 95th percentile was 187, whereas 412 tokens were required to reach the 99th percentile. It should be noted that, while the [NONE] placeholder mechanism functions correctly at the data-representation level, the downstream impact of systematically incomplete entries on model performance has not been empirically quantified in the present study. A controlled missing-field robustness evaluation is identified as a priority for future work (see
Section 6).
The ablation experiments show that the use of the complete representation pattern consistently exceeds partial variants of the input sequences for all evaluated problems. Ablation experiments were conducted to quantify the contribution of each Zthes element to the structured input representation, using the XLM-Roberta model on the concept binding problem (see
Table 1) as a reference configuration. Starting with the basic version only with term (F1 = 0.80), which corresponds to the basic condition of retraining, the phased addition of synonymy information (SYN) yielded a moderate improvement to F1 = 0.81, which confirms that the variability of the surface form is a stable but limited source of increase in the quality of representations. The addition of definition context (DEF) raised F1 to 0.82, reflecting the contribution of semantic scope to removing ambiguity of concepts. Further incorporation of associative links (REL) has not yielded measurable improvement at this level of accuracy, which indicates the marginal contribution of non-hierarchical associative signals to the problem of binding concepts in the presence of a definitive context and information about synonymy. The most pronounced improvement was observed with the addition of hierarchical signals: the combined inclusion of BT and NT tokens increased F1 to 0.84, which gave an absolute increase of 0.02 points compared to the previous configuration and became the largest contribution among all the individual elements. This is consistent with the result given in the relationship classification analysis, where the addition of tokens [BT] and [NT] raises F1 hyponymy from 0.68 to 0.79 for KazBERT. The complete pattern, combining all elements, has reached F1 = 0.84, which confirms that hierarchical metadata of the thesaurus encode information that cannot be retrieved only by surface forms, synonyms, or definition text.
5. Experiments and Results
The experimental design distinguishes two levels of evaluation. At the system level, the full Thesaurus BERT-MIL pipeline (
Figure 1) employs BGE-M3 as a fixed bi-encoder for efficient FAISS-based candidate retrieval, and XLM-Roberta as a cross-encoder for final reranking. At the encoder level, the cross-encoder component is instantiated with three alternative backbones—KazBERT, mBERT, and XLM-Roberta—to assess how the choice of pre-trained architecture affects concept linking and relation classification performance. BGE-M3 is retained as the bi-encoder in all configurations and is not varied across experiments.
Three pre-trained transformer architectures are evaluated under two conditions: standard fine-tuning (basic ft) using only mention–concept pairs without structured thesaurus input; thesaurus fine-tuning (ft + Thesaurus) within the full Thesaurus BERT-MIL conveyor. Basic models: KazBERT—monolingual BERT on Kazakh–Russian buildings; mBERT—12-language checkpoint Google; and XLM-Roberta base—multilingual model for 100 languages. The test set contains 250 reference–concept pairs of 200 annotated operational documents (see
Table 5). As illustrated in
Table 5, the mean ± standard deviation across five runs is presented. The significance of each result against the thesaurus is tested using bootstrap resampling (N = 10,000; significance level α = 0.05) [
43,
44].
It is noteworthy that the absolute increase in F1, due to thesaurus enrichment, is stable on all three architectures (+0.04 F1, +0.05 Top-1 Accuracy). We interpret this pattern as evidence that relational information encoded in the thesaurus (BT/NT/RT) forms a task-specific information signal orthogonal to the total capacity of the model: since none of the three pre-trained models had access to the terminological hierarchy during pre-training, the structured input pattern provides approximately equal information additions regardless of the encoder size. This conclusion is consistent with the theoretical justification of knowledge injection approaches, which assert that relational domain knowledge cannot be approximated by distributive pre-training, and therefore, provide a stable, architectural-neutral improvement when directly injected into the input representation.
Table 5 presents the complete results of the concept linking assessment.
Figure 2 visualizes accuracy, recall and F1-measure across all configurations. Thesaurus enrichment consistently improves all metrics for all three architectures.
The best configuration—XLM-Roberta with thesaurus addition—reaches F1 = 0.84 and Top-5 = 0.94. KazBERT shows the largest absolute increase (+4 points F1), which is due to the limited vocabulary of pre-training and, accordingly, the greater information value of the structured context of the thesaurus.
Figure 3 presents the accuracy of the Top-1 and Top-5 for concept search. All thesaurus-enriched models exceed the Top-5 threshold = 0.90, and XLM-Roberta reaches 0.94.
The gap between Top-1 and Top-5 (~0.14–0.17 for all models) reflects the true terminological ambiguity of the test set, motivating the application of the candidate reranking step in the production pipeline.
Table 6 and
Figure 4 present the results of the relationship classification.
As demonstrated in
Table 7, there are several discernible patterns. Initially, when utilized in isolation,
yields the optimal single-task outcome (F1 = 0.78); the incorporation of
results in an F1 enhancement of +0.03, and the comprehensive three-task amalgamation yields an additional +0.03. Secondly, in relation to relation classification,
alone achieves a Macro-F1 of 0.79, but the three-task model outperforms it by +0.05. Thirdly, it was demonstrated that every combination of two tasks outperforms the best single-task alternative, thus confirming the hypothesis of cross-task regularization.
As demonstrated in
Table 8, a discernible performance gradient is evident across the various relation types. The highest F1 score (0.91) is achieved by a synonym search. It is evident that the BT search (F1 = 0.85) is significantly enhanced by the utilization of the explicit [BT] token. The utilization of definition-based queries (F1 = 0.86) is facilitated by the [DEF] field. The two categories that have been identified as being the most problematic are NT search (F1 = 0.77) and RT search (F1 = 0.72). NT errors predominantly manifest as NT–NT confusion, that is to say, when several hyponyms share a single parent BT concept and similar definition text (e.g., BMP-2 and BMP-3 as NTs of “infantry fighting vehicle”). RT errors frequently manifest as confusion between RT and NT. Operationally related concepts may be found in adjacent subtrees of the hierarchy. These findings underscore the necessity for mining strategies for hard-negative examples that explicitly target confusable pairs within a subtree.
The most pronounced improvement in thesaurus context injection is observed in the distinction between hyponymic (NT) and hyperonymic (BT) relationships, which the basic models often confuse by direction. Adding [BT] and [NT] tokens to the input view raises F1 hyponymies from 0.68 to 0.79 for KazBERT. This confirms that the hierarchical metadata of the thesaurus carry information that is not inferred from the superficial forms of terms.
Figure 5 presents the heat map of the absolute increase from thesaurus enrichment for all models and tasks. The increase is remarkably homogeneous (+0.03–0.05 in absolute terms), which indicates the sustainability of the benefit of structured knowledge injection, independent of a specific task. The largest single gain (+0.05) is observed in the accuracy of the Top-1 prediction of the hyperonym for all three types of models.
All three models were fine-tuned on thesaurus-derived training pairs covering Russian, Kazakh, and English. For cross-language evaluation, XLM-Roberta and mBERT—both including Kazakh and English in their pre-education corpus—were additionally evaluated on Kazakh and English test mentions without any language-specific adaptation, which allows for zero-shot cross-language translation. KazBERT, whose pre-learning is limited to Kazakh and Russian, was evaluated only on Russian and Kazakh test mentions; the English score for this model was excluded because zero-shot translation into English is not supported by its target pre-learning criterion. The results of cross-language F1 are presented separately in the Interlingual Generalization Analysis (see
Figure 4). The narrow ranges of standard deviations (σ ≤ 0.011 across all metrics and configurations) indicate that the framework’s behavior is stable with respect to the choice of a random initial value. The bootstrap resampling test is a statistical technique that is employed to verify the statistical significance of enhancements achieved through thesaurus enrichment. This verification is conducted at a significance level of α = 0.05, which is a commonly accepted threshold for statistical analysis.
Computational Efficiency
Although the primary focus of this study is retrieval quality, we report indicative efficiency figures for completeness. The bi-encoder FAISS retrieval stage operates over an offline-indexed corpus of ~4500 concept embeddings (d = 768) and completes in 2–4 ms per query on CPU, making it negligible in the pipeline. The cross-encoder XLM-RoBERTa reranking stage, which performs pairwise scoring of top-k = 300 candidates, requires approximately 1.8–2.2 s per query with single-sample inference on a V100 GPU. Through batch inference (batch size = 32) and fp16 precision, latency can be reduced to 150–250 ms per query, which is operationally acceptable for interactive terminology lookup. FAISS index construction for the full concept corpus takes under 30 s and needs only to be performed once per thesaurus update. These figures confirm that the framework is deployable within domain-specific NLP systems operating on moderate-scale terminological inventories. For larger inventories, approximate FAISS indices (e.g., IVF) and tighter candidate pruning (top-k = 50) are straightforward extensions. We leave a systematic latency–accuracy trade-off study to future work.
6. Discussion
The statement about the applicability of the proposed framework to any BERT-like architecture is based on three architectural properties common to all models of the given family. Firstly, subword tokenizers in all BERT-like models support the dictionary extension, allowing for the addition of structural tokens ([BT], [NT], [REL], etc.) without changing the weight of the model. Second, all BERT-like encoders generate a [CLS] representation that serves as an attachment of the sequence level; the framework relies solely on this representation for both learning (through three loss functions) and output (through cosine similarity in search), making it independent of the depth or width of the model. Third, loss functions are standard differentiable criteria (contrast, margin, and cross entropy) applicable to any investment made by the encoder. This architectural neutrality is consistent with the results of similar entry-level injection frameworks: SapBERT [
25] and SimCSE [
42] have been successfully applied to several architecturally different BERT-like encoders. The stability of the improvement delta, observed in our experiments on three architecturally different models (KazBERT, mBERT, and XLM-Roberta), serves as a direct empirical confirmation of this statement.
Use of multitasking learning, combining synonym alignment goals, hierarchical relationship prediction, and association classification, demonstrates higher efficiency compared to single-tasking fine-tuning for each task individually.
The advantage of multitasking learning is due to cross-tasking regularization: co-learning on synonym alignment (
) and prediction of hierarchical relations (
) prevents the collapse of synonymous attachments in ways that violate hierarchical order, while the criterion for the classification of relations (
) forces the encoder to maintain discriminatory geometry between BT, NT and RT pairs. The present study examined single-task fine-tuning for synonym alignment only (
), single-task fine-tuning for relationship classification only (
) and full multitask fine-tuning with combined criteria (
). Multi-task fine-tuning consistently surpasses both single-tasking variants in F1 concept binding and Macro-F1 relationship classification across all three model architectures. The increase in multitasking quality (about +0.05–0.07 F1 relative to the best single-tasking variant) reflects cross-tasking regularization: simultaneous optimization for synonym alignment and hierarchical relations prediction prevents the adoption of representations optimal for one criterion but suboptimal for another, generating a geometrically more consistent attachment space supporting both search and classification. This mutual gain is well documented in the literature on multitasking NLP [
44,
45,
46].
The per-class performance gradient reported in
Table 8—SYN (F1 = 0.91) > DEF (0.86) > BT (0.85) > NT (0.77) > RT (0.72)—merits explicit interpretation. This ordering is not a sign of framework inadequacy; it reflects the intrinsic semantic properties of the respective relation types. Synonym relations are directly and exhaustively specified in the thesaurus UF records, and the synonym alignment loss (
) trains the model specifically to align their embeddings, producing high absolute performance. NT errors arise principally from co-hyponym confusion—multiple narrower terms sharing the same parent BT node and similar definitional context—a problem that would require explicit hard-negative mining within subtrees to reduce further. RT errors reflect the fundamental semantic underspecification of associative relations under ISO 25964; RT links are expert-asserted and non-hierarchical, and operationally related concepts frequently occur in adjacent subtrees, making them genuinely difficult to distinguish from NT links using surface and hierarchical cues alone. Crucially, the framework’s central claim is evaluated on the improvement relative to standard fine-tuning baselines, not on absolute per-class F1. Thesaurus enrichment improves NT F1 from 0.68 to 0.79 (KazBERT) and yields consistent gains for RT across all architectures, confirming that the typed relational signal provides meaningful information even for the hardest relation category. We identify subtree-aware hard-negative mining as the most promising avenue to further close the SYN–RT performance gap.
Despite the results obtained, it is necessary to note a number of limitations of the conducted study. First, the test set was formed on the basis of a single institutional thesaurus and, therefore, may not fully reflect the terminological diversity of the various branches of the military. Second, the current version of the thesaurus is characterized by relatively limited coverage of such rapidly developing areas as electronic warfare, cyber operations and space systems, the terminology of which continues to be actively developed. Third, the use of standard metrics of accuracy and completeness relative to a fixed standard does not fully reflect the practical value of the system in semantic search tasks, where the relevance of the results is often graded. In the future, it seems advisable to develop an evaluation system based on graded relevance, involving expert evaluation of search results.
Fourth, the training dataset, although comprising 18,400 examples, was generated automatically from a single thesaurus of approximately 500 concepts. The relatively small number of distinct concepts means that the model’s generalization to novel, out-of-thesaurus terminology has not been empirically tested. While the cross-lingual evaluation provides partial evidence of generalization, extension to new concept types would require retraining or incremental fine-tuning. Fifth, the framework assumes that the input thesaurus is correctly structured and validated according to the Zthes schema. In practice, real-world terminological resources may contain inconsistencies in BT/NT assignments or redundant RT links, and the sensitivity of the model’s relation classification performance to such annotation noise has not been evaluated. Sixth, the reranking step relies on a fixed candidate set size (top-k = 300 from the bi-encoder). The sensitivity of final performance to the choice of k—and whether this threshold remains appropriate as the thesaurus scales—was not systematically studied. Finally, although the framework is described as applicable to any BERT-like architecture, the empirical evaluation is limited to three models (KazBERT, mBERT, and XLM-RoBERTa). Larger-scale multilingual models such as XLM-RoBERTa-large or mDeBERTa, which may offer higher representational capacity, were not evaluated due to computational constraints.
Fifth, the framework’s [NONE] placeholder mechanism for absent Zthes fields was validated descriptively (fewer than 3% of records required truncation), but no controlled experiment was conducted to assess the degradation in concept linking or relation classification performance as a function of field completeness. In practice, newly created or partially curated thesaurus entries may lack ScopeNotes, associative links, or narrower terms. Future work should include a systematic ablation in which fields are progressively masked at inference time—e.g., removing RT, then NT + RT, then all fields except Term and SYN—to characterize the model’s graceful degradation profile and identify which missing fields are most detrimental to each task.
In a broader perspective, the proposed methodology can be applied to any structured terminological resource that complies with the international standard ISO 25964, which formalizes the organization of thesaurus and includes the Zthes format. Potential applications are also the terminology databases of the North Atlantic Treaty Organization member states used to implement STANAG standardization agreements. In such systems, thesaurus plays an important role in ensuring doctrinal and information compatibility.
In addition, the proposed framework can be extended to solve terminology management problems in multinational coalition contexts, where the alignment of concepts between different institutional thesauri remains a constant problem of semantic interoperability. The integration of structured terminological resources with neural language models opens up opportunities for creating semantic search and analysis systems that can take into account both distributed text representations and formalized knowledge of the subject area.
7. Conclusions
This paper presents a framework for integrating the structured knowledge of the military thesaurus into pre-trained transformational language models in order to solve the problems of binding concepts and semantic search on the terminology of the Ground Forces. The proposed framework implements the systematic mapping of the elements of the Zthes thesaurus into text representations compatible with the input of transformer models, uses a multitasking fine-tuning model with three complementary learning goals, and also includes a pipeline of dense search based on the FAISS index for effective binding of concepts at the output stage.
The experiments conducted on the test set formed on the basis of the terminological database of the Ground Forces of the Armed Forces of the Republic of Kazakhstan demonstrate steady improvements in relation to the baselines of standard fine-tuning. The best configuration—the XLM-Roberta model with a complete thesaurus addition—reaches F1 = 0.84 and Top-5 = 0.94 in the concept binding problem, and Macro-F1 = 0.84 in the relationship classification problem. The diagrams presented in
Section 5 further confirm the consistent and architecturally independent nature of the improvements achieved through the integration of thesaurus knowledge.
The scientific novelty of the research is determined by three main contributions. First, it is proposed to systematically map elements of the Zthes standard into representations suitable for use in transformer models, which makes the approach applicable to any terminological resource that meets the ISO 25964 standard. Second, one of the first empirical evaluations of the methods of transformational linking concepts and semantic search for military terminology of the Ground Forces in a trilingual environment (Russian, Kazakh and English) was performed. Thirdly, a reproducible multitasking learning protocol has been developed that surpasses single-tasking baselines in all the metrics under consideration without the need to modify the architecture of the base model.
Promising areas of further research include the expansion of the thesaurus to the fields of rapidly developing military technologies, the development of a protocol for evaluating semantic search based on graded relevance, as well as the study of methods of continuous learning that allow for incrementally updating the space of investment concepts as the institutional thesaurus evolves. In a broader context, the proposed approach opens up opportunities for integrating knowledge organization systems and neural language models in the tasks of semantic search, terminology management and interlinguistic interoperability in specialized subject areas.