An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Bogdanović, Miloš; Frtunić Gligorijević, Milena; Kocić, Jelena; Stoimenov, Leonid

doi:10.3390/app15137491

Open AccessArticle

An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Faculty of Electronic Engineering, University of Nis, 18000 Nis, Serbia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7491; https://doi.org/10.3390/app15137491

Submission received: 29 May 2025 / Revised: 27 June 2025 / Accepted: 1 July 2025 / Published: 3 July 2025

(This article belongs to the Special Issue Advanced Large Language Models and Natural Language Processing Applications)

Download

Browse Figures

Versions Notes

Abstract

Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within transformer-based models is directly related to model efficiency, which motivated the research we present in this paper. Our goal in this research was to demonstrate that the appropriate selection of data used for tokenizer training has a significant impact on tokenizer performance. Subsequently, we will demonstrate that efficient tokenizers and models can be developed even if language resources are limited. To do so, we will present a domain-adapted large language model tokenizer developed for masked language modeling of the Serbian legal domain. In this paper, we will present a comparison of the tokenization performance for a domain-adapted tokenizer in version 2 of the SrBERTa language model we developed, against the performances of five other tokenizers belonging to state-of-the-art multilingual, Slavic or Serbian-specific models—XLM-RoBERTa (base-sized), BERTić, Jerteh-81, SrBERTa v1, NER4Legal_SRB. The comparison is performed using a test dataset consisting of 275,660 samples of legal texts written in the Cyrillic alphabet gathered from the Official Gazette of the Republic of Serbia. This dataset contains 197,134 distinct words, while the overall word count is 5,265,352. We will show that our tokenizer, trained upon a domain-adapted dataset, outperforms presented tokenizers by at least 4.5% ranging to 54.62%, regarding the number of tokens generated for the whole test dataset. In terms of tokenizer fertility, we will show that our tokenizer outperforms compared tokenizers by at least 6.39% ranging to 56.8%.

Keywords:

tokenization; large language model; natural language processing; BERT; domain adaptation; Serbian

1. Introduction

Recent advances in artificial intelligence and machine learning have produced a number of models and solutions that heavily rely on text. Most of the technologies and products we use today require vast amounts of data to be stored in different types of documents both during and after they are created. Large language models (LLMs) are possibly the most visible examples of such models to a wider audience. Machines can now comprehend natural languages thanks to architectures like BERT [1], RoBERTa [2], GPT [3], and RWKV [4]. These models allow for specialized processing, which is necessary for a range of natural language processing jobs.

Word embedding dates back a few years, even though the creation of transformer architectures has signaled the start of a new phase in the evolution of language models. One of the earliest neural language models (NLMs) that resemble n-gram models was created by Bengio et al. [5]. Ref. [6] then used NLMs for machine translation with success. NLMs became much more popular after Mikolov [7], released RNNLM, an open source NLM toolkit. Machine translation, text creation, and text categorization [8] are just a few of the natural language applications that have made extensive use of NLMs based on recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) [9] and gated recurrent unit (GRU) [10]. With the addition of a self-attention mechanism on top of the transformers, BERT [1] marked the next important step in the development of language models. Due to its performance in pre-training tasks on large unlabelled corpora, BERT had an influence on the development of a number of models and architectures, including GPT-2 [11] and BART [12].

BERT architecture introduced the masked language modeling job as its focus. The objective of the masked language modeling job is to predict a masked token inside a sequence of tokens that the model processes bidirectionally. Therefore, when an in-depth contextual understanding of a sequence is required, masked language modeling is the appropriate option. Large language models acquire contextual representations in an unsupervised manner by minimizing a masked language modeling objective across a base corpus. The model training process in most cases includes two phases—pre-training and fine-tuning. Pre-training refers to the phase of unsupervised language model training. The fine-tuning procedure involves training a pre-trained model on a specific downstream task using labeled data for supervised tasks, following the replacement of the output head with a lightweight classifier.

Masked language modeling (MLM) models are frequently used in recent language analysis approaches to assist in learning natural language features. However, domain-specific texts, such as formal terminology, present a unique training problem for these networks. Since the pretraining phase is usually based on the usage of a substantial amount of textual material, often retrieved from the internet, this corpus does not adequately represent domain-specific terms, such as legal terms. Subsequently, this type of text used in the language model pretraining phase frequently excludes domain-specific terms from the vocabulary building process. Thus, the model becomes partially unable to recommend phrases that are not part of its vocabulary, e.g., not directly represented within the tokenizer abilities, due to the nature of the training data and the previously described steps.

The tokenizer, which transforms unprocessed text into a series of tokens the model can understand, is one of the essential parts supporting language models. Tokenization techniques are essential to LLM performance and comprehension [13] because they have a direct impact on how well they can capture context and semantics. In transformer-based models, tokenization efficiency and model efficiency are directly correlated [14]. Additionally, tokenizer performance may be significantly impacted by the choice of data used for training. In this research, we address the problem of creating effective tokenizers for transformer-based domain-specific language models to illustrate the magnitude of this influence along with the possibilities this influence offers.

For research purposes, the domain of the Serbian language was selected. In terms of digital infrastructure and digitized language materials, Serbian is one of the languages with few well-organized resources, though it has a large amount of written content. Because of this, it is quite challenging to develop a high-quality language model for the Serbian language. Furthermore, a significant challenge is the development of a model that can accurately represent specialized domains within a given language, such as the formal legal domain. Being so, our goal is to demonstrate that efficient tokenizers and models can be developed even if language resources are limited.

In this paper, the performance of five tokenizers from state-of-the-art multilingual, Slavic, or Serbian-specific models—XLM-RoBERTa [15] (base-sized model), BERTić [16], Jerteh-81 [17], SrBERTa v1 [18], and NER4Legal_SRB [19]—will be compared with the tokenization performance of a domain-adapted tokenizer in version 2 of the SrBERTa language model that we developed. A test dataset including 275,660 samples of legal texts in the Cyrillic alphabet collected from the Official Gazette of the Republic of Serbia is used for the comparison. We will demonstrate that our tokenizer, which was trained on a domain-adapted dataset, reduces the total amount of tokens created and performs better in terms of tokenizer fertility compared to other tokenizers.

2. Related Work

2.1. Language Models—From Large to Small and Back

The number of parameters in language models today varies from a few hundred million to several hundred billion. The 11 billion-parameter Flan-T5 was created especially for instruction-tuning [20]. With 11 billion parameters, CodeGen is an autoregressive language model for writing code [21]. The CodeGen model family is continuously trained on three datasets: THEPILE, BIGQUERY, and BIGPYTHON. By using MTF on the pre-trained multilingual BLOOM and mT5 model families, fine-tuned variants called BLOOMZ and mT0 were produced [22]. PanGu-α is a large-scale auto-regressive language model that can have up to 200 billion parameters [23]. Whether working in few-shot or zero-shot circumstances, PanGu-α outperforms rival models on a range of tasks. The LLaMA model [24], which contains 65 billion parameters, has attracted a lot of interest from the scientific community due to its open-source nature. This model was modified to include features like order following that are comparable to those of ChatGPT (https://openai.com/index/chatgpt/, accessed on 30 June 2025). The Falcon-40B was introduced by the Technology Innovation Institute (TII) [25]. A total of 1000 billion RefinedWeb tokens enhanced by well-chosen corpora were used to train Falcon-40B.

Small language models (SMLs) are becoming increasingly popular among researchers and industry groups. Notably, since the end of 2023, the number of SLM models has grown dramatically. SLMs primarily use three types of attention mechanisms: Multi-Head Attention (MHA), Multi-Query Attention (MQA), GroupQuery Attention (GQA), and Multi-Head Latent Attention. The most common self-attention mechanism in transformer models is Multi-Head Attention, which uses many attention heads to allow the model to focus on different input data segments at the same time. A variation in multi-head attention, known as “Group-Query Attention”, aims to preserve some degree of divergence in the attention mechanism while employing fewer query groups by allowing multiple heads to share query representations while maintaining separate key and value representations, thereby reducing processing complexity. Multi-Head Latent Attention outperforms MHA by using low-rank key-value joint compression.

Due to their capabilities and traits, transformer versions such as RWKV [4] and Mamba [26] are currently gaining popularity and are highly anticipated. The most well-known models in this category are Meta OPT [27], Meta Galactica [28], BigScience Bloom [29], Cerebras Cerebras-GPT [30], Microsoft Phi-2 [31], Microsoft Phi-3-mini [32], TinyLlama TinyLlama [33], Meituan MobileLlama [34], Alibaba Qwen 2.5 [35], Google Gemma [36], Apple OpenELM [37], HuggingFace SmolLM [38], DataBricks Dolly-v2 [39], H2O Danube3 [40].

2.2. Domain-Adapted Tokenizers and Tokenization Efficiency Metrics

The legal domain has a distinct characteristic unique to its field. As stated by the authors of [41] “LLMs struggle to fully capture the complexities of legal language” and “may fumble with the specialized terminology and citation formats or outright hallucinate domain-specific knowledge, losing the rigor and precision essential in legal contexts”. Moreover, the authors of [42] state that models trained on generic domain corpora “should not be directly applied to a specific domain corpus, as the distributional representation (embeddings) of their lexical units may significantly shift from the nuances and peculiarities expressed in domain-specific texts”. The authors further say that this statement “certainly holds for the legal domain as well, where language understanding is particularly challenging”. The authors of [43] also conclude that domain adaptation is crucial for task-specific models to learn to capture the nuances of legal language and context. The legal language is distinct in vocabulary, semantics, and reasoning [44,45,46]. Like in other languages, Serbian legal and legislation texts contain a number of specifics unique to its field and a large number of rare, domain-specific terms.

A significant effort has been put into finding ways to use task data and specialized corpora to improve the pre-training of language models for particular domains, in addition to the substantial amount of work that goes into creating models. Multilingual models are often considered as a means to transfer both domain and syntactic knowledge [47], but their capabilities should be further investigated due to peculiar aspects of languages such as gender agreement, relative clauses, and passive transformations. As a domain capable of exposing these aspects, legal domain stands out. Therefore, significant effort has been put into building model from the ground up for various legal domain tasks [43].

One advantage of building the model from the ground up with in-domain data rather than employing continuous pretraining has been noted by Gu et al. [48]. Their results show that domain-specific language is successfully included in the tokenizer’s vocabulary. The meanings of these specialized terms can therefore be directly transferred through their fixed embeddings, negating the need for the language model to represent them through contextual embeddings of their subwords.

In [49], a different technique for domain adaptation of language models was used. Authors have shown notable performance improvements on three IT-related tasks: extractive reading comprehension, document ranking, and duplicate question detection. Their approach is based on adding complete words that are common in the target domain but absent from a language model tokenizer. It was deployed gradually to a Roberta Large language model architecture that had already been trained.

Eight classification tasks and four domains (news, reviews, and articles in computer science and biomedicine) are included in the research presented by Gururangan et al. [50]. Domain-adaptive pretraining, a second iteration of in-domain pretraining, has been demonstrated to enhance performance in both high- and low-resource circumstances. Additionally, research has shown that using basic data selection strategies together with task corpus adaptation is a helpful backup, especially when domain-adaptive pretraining resources are unavailable. Furthermore, by altering their tokenizers, Sachidananda et al. [49] propose an alternative approach for integrating pre-trained language models into new domains. They claim that by examining the conditional token distributions of the base and domain-specific datasets, domain-specific sub-word sequences can be successfully identified.

Tokenizers have been evaluated using a variety of measures and techniques. The average number of tokens generated for each word is measured by a metric called subword fertility [51]. A tokenizer’s fertility score should ideally be 1.0, meaning that most words are represented as a single token; a number higher than this shows that many words are divided into several tokens [52]. The percentage of words that are separated into multiple tokens is represented by the proportion of continued words [52], another often-used measure. For this measure, zero is the ideal value. The value of zero will indicate that fewer words are split—a smaller percentage is better [52]. Another metric for evaluating tokenizers is the Normalized Sequence Length (NSL). Compared to a reference baseline tokenizer, NSL determines the average length of tokenized sequences produced by a tokenizer [51].

3. Tokenizer Performance—Models, Dataset, and Performance Measure Insights

Regarding digital infrastructure and digitized linguistic resources, Serbian is among the languages with few well-structured materials, although possessing a substantial volume of written literature. Consequently, a goal of creating an efficient tokenizer for a model that can accurately describe specialized domains inside the Serbian language, such as the formal legal domain, is a difficult task. To demonstrate the efficiency of a domain-adapted tokenizer in version 2 of the SrBERTa language model that we developed, baseline tokenizers were selected from publicly available state-of-the-art multilingual, Slavic, or Serbian-specific models. At the time this research was conducted, more than 400 models supporting Slavic languages were available on HuggingFace platform. However, the number of models built using Serbia as primary language is significantly lower. Baseline models were selected using the following criteria: the architecture (BERT, RoBERTa), the size of the architecture and vocabulary (comparable to 50,265 tokens within SrBERTa v2 vocabulary), primary language used for model development (Serbian), domain adaptation (legal domain if possible) and overall usage of the model reported on HuggingFace.

3.1. State-of-the-Art Multilingual, Slavic or Serbian-Specific Models

Developed primarily to understand the formal language found in Serbian legislation, SrBERTa v1 [18] is a language model built on RoBERTa that can be used for different language modeling applications. Initially, the OSCAR dataset, which included 645,747 texts or roughly 150 million words (with a total size of more than 2 GB when stored), was used to train the model and its tokenizer. A vocabulary of 30,522 tokens was produced by the model using the ByteLevelBPETokenizer [53] tokenization technique and its implementation from the HuggingFace library [54].

The second iteration of this model was developed to improve mask language modeling capabilities. Two distinct datasets, the OSCAR dataset and a dataset devoted exclusively to legal texts, were utilized to train SrBERTa v2 [55]. The OSCAR 23.01 dataset, which required 7.7 GB of storage space and had 1,677,896 documents with a total word count of 632,781,822, was used for this training. A total of 1.6 GB of legal documents from the Republic of Serbia’s Legal Information System made up the second dataset. Special attention was given to tokenizer development, in terms of the structure of the dataset used for tokenizer training. The RoBERTa network and the tokenisation methods were implemented as ready-made architectures by the HuggingFace community. All these implementations are variations of the transformers library, version 4.17.0.

The tokenizer for SrBERTa v2 was created using an equal mixture of legislative texts and texts taken from the OSCAR 23.01 dataset to ensure that more legal phrases are processed as entire words rather than subwords. The adaptation strategy we have chosen belongs to a group of explicit data selection methods, which may introduce domain-specific biases. Therefore, our future work envisions research regarding the optimal balance for data selection. Further, each legislative text was preprocessed to minimize data loss due to truncation, with the maximum input sequence size set to 512 tokens. This meant that all texts were divided into smaller units of 512 tokens and then concatenated using the newline character. This preprocessing ensured the generation of inputs that can be optimally utilized by the network. SrBERTa v2 tokenizer has a vocabulary containing 50,265 tokens, it used a minimum frequency of 2 and the following special tokens: “<s>”, “</s>”, “<pad>”, “<unk>”, and “<mask>”. Using an NVIDIA Quadro RTX 5000 GPU, along with AdamW optimiser, the SrBERTa tokenizer and network training took place across 45 epochs and 31 days in total.

Eight billion tokens of crawled text from the Croatian, Bosnian, Serbian, and Montenegrin web domains were used to pre-train the transformer model BERTić [16]. The tasks of named entity identification, part-of-speech tagging, geolocation prediction, and commonsense causal reasoning were used to assess this model. The authors have decided to train transformer models using the Electra technique [56]. Further, the authors created a WordPiece vocabulary with a 32,000 token vocabulary size, much like BERT. The HuggingFace tokenizers library was used to train a WordPiece model on a random sample of 10 million paragraphs from the entire dataset.

Jerteh-81 [17] is one of the best and most widely used large language models for the Serbian language. It is based on RoBERTa-base architecture with 81 million parameters. Jerteh-81 was trained on publicly available datasets which, all together, contain over 4 billion tokens. When published, Jerteh-81 demonstrated high-quality results on semantic text similarity tasks. Jerteh-81 vocabulary contains 50,265 tokens.

XLM-RoBERTa [15] is one of the first models to prove that pretraining multilingual language models at scale leads to significant performance gains. This model is a transformer-based masked language model trained on one hundred languages, using more than two terabytes of filtered CommonCrawl data. As shown in [15], XLM-RoBERTa outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks and performs particularly well on low-resource languages, which is the reason it has been selected for this research. Model capabilities were achieved through the trade-offs between positive transfer and capacity dilution and the performance of high and low resource languages at scale. The XLM-RoBERTa vocabulary contains 250,002 tokens.

The NER4Legal_SRB [19] is a LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages pre-trained BERT, which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content. The model was trained on a novel dataset including court rulings and achieved a mean F1 score of 0.96 on cross-validation tests using a manually labeled dataset. Additionally, the authors presented results on examples of intentionally modified text inputs to confirm the model’s robustness. NER4Legal_SRB contains a WordPiece vocabulary with a 32,000 token vocabulary size.

3.2. Dataset and Performance Measure Insights

For the purpose of this research, we have used 275,660 examples from Cyrillic legal texts gathered from the Official Gazette of the Republic of Serbia. The examples contained 197,134 distinct words and 5,265,352 words in total. Further, the selected examples are of different lengths, with the longest example containing 2750 words. The distribution of the number of examples by the number of words in the examples is presented in Figure 1 and Figure 2. Figure 1 presents an overall distribution of the number of examples by the number of words in the examples. It can be noticed that the number of examples decreases with the increase in the number of words within the examples.

Given that shorter examples occurred most frequently, in order to better show the distribution of the most numerous examples, the distribution of examples up to 100 words long from Figure 1 is highlighted in Figure 2. The overview presented in Figure 2 covers 98.8% of all test datasets. From the presented distributions, it can be noticed that the most numerous examples contain two words, and while other examples occur less frequently, groups containing 95 words or more each contain fewer than 100 examples. Overall, the average number of words per example is 19.10, while the median value is 13.

The analysis of the tokenizer’s performance was performed using the presented dataset, through the following measures:

Number of tokens used to represent a sequence,
Fertility measure, and
Number of tokens in relation to the word frequency measure.

3.3. Number of Tokens Used to Represent a Sequence

A metric quantifying the number of tokens required for sequence representation is a basic overview that can serve for tokenizer comparison. It can be considered that a more efficient tokenizer is one that uses fewer tokens to represent a sequence. Furthermore, median and average values for the number of tokens used for one test dataset can be an indicator of tokenizer efficiency and, therefore, can be used for tokenizer comparison. Moreover, a graphical preview of the number of used tokens in relation to the number of words for the whole dataset, for multiple tokenizers, indicates how different tokenizers perform on different sequence lengths, and therefore is good for comparing efficiency over sequence length.

3.4. Fertility Measure

Fertility measure refers to the number of tokens needed to encode a sequence. This measure has a lower value with the increase in the number of tokens that represent the whole word in the sequence, rather than some parts of it. Consequently, the more efficient the tokenizer is, the lower the fertility measure value is. Fertility measure FM for a specific example can be formally calculated using the following definition:

Definition 1:

Fertility measure

F M

for an example

E

is equal to:

{F M}_{E} = \frac{l e n (t o k e n i z e (E))}{n u m_w o r d s (E)}

where

t o k e n i z e

returns a list of tokens representing a sequence

E

,

l e n

returns the number of elements, and

n u m_w o r d s

returns the number of words within the sequence

E

.

3.5. Number of Tokens in Relation to the Word Frequency Measure

To better understand word compression between tokenizers, we are relying on measure TFM that represents the number of tokens used per word in relation to the word’s frequency within test examples. This measure is calculated as an average value of the token-per-word ratio over all words occurring with the same frequency, and the results are plotted for all available frequencies within the test dataset. The defined TFM measure can be formally calculated using the following definition:

Definition 2:

T F M

measure is defined as:

T F M = {{T F M}_{f} | \forall f \in F q}

{T F M}_{f} = a v g ({W F}_{f})

{W F}_{f} = \{l e n (t o k e n i z e (w))| w \in w o r d s, f r e q u e n c y (w) = f},

where

F q

is a set of all available word frequencies,

{T F M}_{f}

is a

T F M

measure for a particular frequency

f

,

w o r d s

is a set of all available words in the dataset,

t o k e n i z e

returns a list of tokens representing a word

w

,

l e n

returns the number of elements,

a v g

returns an average value for all elements, and

f r e q u e n c y (w)

returns the frequency for word

w

.

As the legal domain is distinct in vocabulary, semantics, and reasoning [44,45,46], with a large number of domain-specific terms, the TMF measure can reveal the tokenizers’ efficiency for downstream tasks like masked language modeling. For such tasks, representing a word with a single token is crucial for task success, and TMF is capable of revealing the tokenizer’s efficiency by revealing the average number of tokens used for word representation based on how frequently it appears in the text.

4. Results and Discussion

All experiments were executed using a single server instance containing two 16-core CPUs, 384 GB RAM, and an NVIDIA A40 graphical processing unit and tokenization of all examples, depending on the tokenizer, lasted from 46 seconds up to 1 minute. The average number of tokens used for the tokenization of examples in relation to the number of words within examples for all analyzed tokenizers is presented in Figure 3 and Figure 4. Figure 3 contains results for the whole dataset, while Figure 4 contains a zoomed preview of the results where the number of words was lower than 500 in order to more clearly present comparison results for shorter examples that are more numerous. From the presented results, it can be noticed that on average, the SrBERTa v2 tokenizer generated fewer tokens compared to all other tokenizers, both for shorter and longer examples, while BERTić and NER4Legal_SRB tokenizers obtained very similar results.

It was noticed that, on average, for all examples, the SrBERTa v2 tokenizer generated 7.75% tokens less compared to the SrBERTa v1 tokenizer, 4.87% compared to the Jerteh-81 tokenizer, and 4.5% compared to the XLM-RoBERTa tokenizer. A comparative analysis of the tokenizers SrBERTa v2 and BERTić revealed that SrBERTa v2 produced, on average, 54.62% fewer tokens. A comparison between SrBERTa v2 and NER4Legal_SRB tokenizers demonstrated the same result, with SrBERTa v2 generating 54.62% fewer tokens on average.

Further, a deeper analysis revealed that in 30.56% of examples, tokenizers SrBERTa v2 and SrBERTa v1 generated the same number of tokens, while for tokenizers SrBERTa v2 and Jerteh-81, this percentage was 29.41%. Moreover, in 6.49% of examples, tokenizers SrBERTa v2 and XLM-RoBERTa generated the same number of tokens. However, this percentage was 0.46% for both comparisons of SrBERTa v2 versus BERTić, and SrBERTa v2 versus NER4Legal_SRB tokenizers.

In the second part of the analysis, we compared the fertility measures for each of the analyzed results. The overall results, containing average and median fertility values based on all test examples, are presented in Table 1. From the presented results, it can be concluded that the tokenizer SrBERTa v2 obtained the best fertility results by having the lowest average and median values, followed by Jerteh-81 and XLM-RoBERTa tokenizers. BERTić and NER4Legal_SRB obtained the highest fertility results. More precisely, the SrBERTa v2 tokenizer, on average, outperformed the previous SrBERTa version and XLM-RoBERTa by 9.03%, Jerteh-81 by 6.39%, and BERTić and NER4Legal_SRB by 56.83%.

For deeper fertility analysis, we have plotted the results in Figure 5 and Figure 6 to observe the fertility values for different numbers of words. Results presented in Figure 5 are overall results generated based on the whole dataset, while results presented in Figure 6 are a preview of all examples up to 100 words. From the presented results, it was noticed that the tokenizer XLM-RoBERTa has the lowest average fertility values for examples shorter than three words. However, for examples longer than three words, both tokenizers SrBERTa v2 and Jerteh-81 obtained better results than the tokenizer XLM-RoBERTa. Further, it was also noticed that although the SrBERTa v2 tokenizer had the best average results for most of the sequence length, there were situations where Jerteh-81 obtained slightly better results.

Lastly, we analyzed the number of tokens in relation to the word frequency. All test examples were divided into frequency-based groups, with group 1 containing words with frequencies ranging from 1 to 1000, group 2 containing words with frequencies ranging from 1001 to 2000, and so on, in successive intervals of 1000. The obtained results were averaged by groups and presented in Figure 7. From the presented results it can be concluded that for words with lower frequency, XLM-RoBERTa tokenizer on average used fewer tokens for word representation in comparison to other tokenizers. Further, for lower frequency words SrBERTa v2 and Jerteh-81 tokenizers obtained similar average results, which were better than the results of tokenizers SrBERTa v1, BERTić, and NER4Legal_SRB. Further, on the average scale for words with higher frequency, all tokenizers had similar performance with occasional spikes for tokenizers Jerteh-81, BERTić, and NER4Legal_SRB.

From the presented results, it can be concluded that the SrBERTa v2 tokenizer outperformed the SrBERTa v1 tokenizer in every performed analysis, showing that applied tokenizer adaptation has a great impact on tokenizer efficiency for a specific domain. Compared to other tokenizers, it should be noted that Jerteh-81 and XLMRoBERta share the same architecture size as SrBERTA v2 but are trained on larger datasets. In the observed set, Jerteh-81 tokenizer and model were trained using the highest quality datasets for the Serbian language, while XLMRoBERTa used the most extensive datasets and had a larger vocabulary (~250 k). SrBERTa v2 stands out in that it is the only one that uses a mix of data that is used to enhance the domain impact in the training phase of the tokenizer. Despite the advantages that Jerteh-81 and XLMRoBERTa have over the SrBERTa v2 model in the pretraining phase, the SrBERTa v2 tokenizer outperforms these models in the average values of generated tokens per sequence. These results confirm the starting premise of the research—it is possible to create a domain-efficient tokenizer, even in the case of smaller or lower-quality language resources. Further, for sequences containing prevalently general terms, Jerteh-81 and XLMRoBERTa were expected to show slightly better performance.

Deeper results analysis reveals tokenizer efficiency to be sequence length dependent. As example lengths increase, so does the probability of occurrence of domain-specific terms. In such cases, domain-specific terms emerge and thus the quality of the SrBERT v2 tokenizer becomes dominant. A similar conclusion could be made for the Jerteh-81 tokenizer, given its pretraining dataset. Consequently, the quality of the domain-adapted tokenizer is more meaningful if it is observed for longer sequences, which is consistent with the nature of the testing domain—legal text.

Low-frequency words are more efficiently represented by the XLMRoBERTa tokenizer, which is a direct consequence of the vocabulary size. A larger vocabulary allows the XLMRoBERta tokenizer to have a larger number of building elements used to efficiently tokenize out-of-distribution words, e.g., words that were not used during the training of the tokenizer. As word frequency increases, so does the influence of domain adaptation, making SrBERTa v2 equivalent in terms of tokenizer performance.

Although developed for a similar group of languages, BERTić and NER4Legal_SRB demonstrated somewhat lower efficiency in the test. The vocabulary of the BERTić and NER4Legal_SRB models is smaller (30,522), but the size of the architecture is comparable to the Jerteh-81 and SrBERTa v2 models in terms of the number of parameters (~110 M), which directly indicates that the efficiency of the model depends on the quality of the tokenizer.

NER4Legal_SRB was created starting from the BERTić model through a fine-tuning process. For this reason, the results are very similar. In the specific case, the experiment proved that the fine-tuning did not affect the vocabulary structure and that the basic model is sensitive to the Cyrillic alphabet, e.g., it generates more tokens for the Cyrillic written text. This is a direct consequence of the BERTić and NER4Legal_SRB tokenizers training process, resulting in lower tokenizer performances for the presented testing dataset. The level of efficiency of the NER4Legal_SRB tokenizer can be seen as less efficient compared to the basic BERTić model, given that NER4Legal_SRB is domain-oriented towards the legal domain written in Cyrillic script.

5. Conclusions

Models trained using vast, diverse corpora produce representations that perform well on a variety of tasks when used with datasets of different sizes collected from heterogeneous sources. However, recent studies show that the corpora domain—that is, a distribution across languages that defines a certain topic or genre—remains significant for model efficiency. The research we discuss in this article was motivated by our belief that tokenization efficiency and model efficiency are closely related.

Within this paper, we have analyzed the efficiency of SrBERTa tokenizer v2 trained on the mixture of general and legal data in relation to SrBERTa v1 tokenizer, Jerteh-81 tokenizer, BERTić tokenizer, XLM-RoBERTa tokenizer, and NER4Legal_SRB—all tokenizers belonging to state-of-the-art multilingual, Slavic, or Serbian-specific models. The analysis we have presented in this paper confirms our assumptions—both the domain of the training data and the quality of the training data have a significant impact on tokenizer performances and excellent results can be achieved even with limited resources if the resources are appropriately selected. By contrasting the SrBERTa v2 tokenizer’s performance with that of any other tokenizer developed for models trained on broad Slavic and/or Serbian texts, this is verified. In contrast to other tokenizers, the SrBERTa v2 tokenizer was created with a 50% to 50% ratio of legislation and internet-derived texts. Conversely, the largest and best-quality datasets were used to train Jerteh-81 while XLMRoBERTa used the most extensive datasets and has a larger vocabulary (~250 k). In the case of a masked language modeling task, these facts have demonstrated their impact on tokenizer efficiency.

As previously shown, SrBERTa v2 tokenizer, trained upon a domain-adapted dataset, outperforms presented tokenizers by at least 4.5%, ranging to 54.62%, regarding the number of tokens generated for the whole test dataset. In terms of tokenizer fertility, this tokenizer outperforms compared tokenizers by at least 6.39%, ranging to 56.8%. It should be noted that the SrBERTa v2 tokenizer could benefit from a size increase. As proof, the XLMRoBERTa tokenizer outperforms all other tokenizers in cases of low-frequency words, which is a direct consequence of the vocabulary size. Further, the usage of Cyrillic script had a negative impact on BERTić and NER4Legal_SRB tokenizer performances which confirms the significance of appropriate data selection for domain-adapted models.

Within this research, our focus was on the performance analysis of the domain-adapted tokenizer created with a 50% to 50% ratio of legislation and internet-derived texts. In future research, we plan to analyze and compare the performance of the domain-adapted tokenizers trained on the different mixtures of datasets and expand current measures with paired t-test tests on such created tokenizers. Further, we aim to analyze the impact of the domain-adapted text selection used for the tokenizer training on the overall performance of the domain-adapted tokenizers.

Author Contributions

M.B.: Conceptualization, Methodology, Software, Writing M.F.G.: Software, Data Curation, Methodology, Writing J.K.: Software L.S.: Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-137/2025-03/200102].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The SrBERTa model presented in this study is openly available at https://huggingface.co/JelenaTosic/SRBerta (accessed on 21 Jun 2025). Corpus of Legislation texts of Republic of Serbia 1.0 used within the study is available at https://www.clarin.si/repository/xmlui/handle/11356/1754 (accessed on 21 Jun 2025). Documents used during the evaluation can be found at https://pravno-informacioni-sistem.rs/reg-overview (accessed on 21 Jun 2025). All other models can be found on the following Hugging Face repositories: https://huggingface.co/FacebookAI/xlm-roberta-base (accessed on 21 Jun 2025), https://huggingface.co/jerteh/Jerteh-81 (accessed on 21 Jun 2025), https://huggingface.co/classla/bcms-bertic (accessed on 21 Jun 2025), https://huggingface.co/kalusev/NER4Legal_SRB (accessed on 21 Jun 2025).

Acknowledgments

Authors would like to thank the Ministry of Science, Technological Development and Innovation of the Republic of Serbia for funding this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Cedarville, OH, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. ISBN 9781713829546. [Google Scholar]
Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model.Part of Advances in neural information processing systems. In Proceedings of the 2000 Neural Information Processing Systems (NIPS) Conference, Denver, CO, USA, 10–16 December 2023; The MIT Press: Cambridge, MA, USA, 2000. ISBN 9780262526517. [Google Scholar]
Schwenk, H.; Dechelotte, D.; Gauvain, J.-L. Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 17–21 July 2006; pp. 723–730. [Google Scholar]
Mikolov, T.; Deoras, A.; Povey, D.; Burget, L.; Černocky, J. Strategies for training large scale neural network language models. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 196–201. [Google Scholar]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14), Vol. 2. MIT Press, Cambridge, MA, USA; pp. 3104–3112.
Cho, K.; Van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Goldman, O.; Caciularu, A.; Eyal, M.; Cao, K.; Szpektor, I.; Tsarfaty, R. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. arXiv 2024. [Google Scholar] [CrossRef]
Kong, Z.; Li, Y.; Zeng, F.; Xin, L.; Messica, S.; Lin, X.; Zhao, P.; Kellis, M.; Tang, H.; Zitnik, M. Token Reduction Should Go Beyond Efficiency in Generative Models—From Vision, Language to Multimodality. arXiv 2025. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar]
Ljubešić, N.; Lauc, D. BERTić—The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, Online, 20 April 2021; Association for Computational Linguistics: Kiyv, Ukraine, 2021; pp. 37–42. [Google Scholar]
Škorić, M. Novi jezički modeli za srpski jezik. arXiv 2024. (In Serbian) [Google Scholar] [CrossRef]
Bogdanović, M.; Kocić, J.; Stoimenov, L. SRBerta—A Transformer Language Model for Serbian Cyrillic Legal Texts. Information 2024, 15, 74. [Google Scholar] [CrossRef]
Kalušev, V.; Brkljač, B. Named entity recognition for Serbian legal documents: Design, methodology and dataset development. arXiv 2025. [Google Scholar] [CrossRef]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv 2022, arXiv:2203.13474. [Google Scholar]
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T.L.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual generalization through multitask finetuning. arXiv 2022, arXiv:2211.01786. [Google Scholar]
Zeng, W.; Ren, X.; Su, T.; Wang, H.; Liao, Y.; Wang, Z.; Jiang, X.; Yang, Z.; Wang, K.; Zhang, X.; et al. Pangu-α: Large-scale autoregressive pretrained chinese language models with auto parallel computation. arXiv 2021, arXiv:2104.12369. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Roziere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2022, arXiv:2302.13971. [Google Scholar]
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, E.; Heslow, D.; Launay, J.; Malartic, Q.; et al. Falcon-40B: An Open Large Language Model with State-of-the-Art Performance. 2023. Available online: https://huggingface.co/tiiuae/falcon-40b (accessed on 21 May 2025).
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Facebook. facebook/opt-125m. Available online: https://huggingface.co/facebook/opt-125m (accessed on 25 May 2022).
Facebook. facebook/galactica-125m. Available online: https://huggingface.co/facebook/galactica-125m (accessed on 22 November 2022).
BigScience. bigscience/bloom-560m. Available online: https://huggingface.co/bigscience/bloom-560m (accessed on 22 November 2022).
Bisk, Y.; Zellers, R.; Le Bras, R.; Gao, J.; Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Microsoft. microsoft/phi-2. Available online: https://huggingface.co/microsoft/phi-2 (accessed on 10 December 2023).
Microsoft. microsoft/phi-3-mini. Available online: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct (accessed on 25 June 2025).
Tinyllama. Available online: https://huggingface.co/tinyllama (accessed on 10 December 2023).
Meituan. Mobilellama. Available online: https://huggingface.co/mtgv/MobileLLaMA-1.4B-Base (accessed on 25 June 2025).
Alibaba. Qwen 2.5. Available online: https://qwenlm.github.io/blog/qwen2.5/ (accessed on 14 September 2024).
Google. Gemma. Available online: https://huggingface.co/google/gemma-3-4b-it (accessed on 25 June 2025).
Apple. Openelm. Available online: https://huggingface.co/apple/OpenELM (accessed on 28 April 2024).
HuggingFace. Smollm. Available online: https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct (accessed on 25 June 2025).
DataBricks. databricks/dolly-v2-3b. Available online: https://huggingface.co/databricks/dolly-v2-3b (accessed on 28 April 2023).
H2O.ai. h2o-danube3-4b-base. Available online: https://huggingface.co/h2oai/h2o-danube3-4b-base (accessed on 10 January 2025).
Padiu, B.; Iacob, R.; Rebedea, T.; Dascalu, M. To What Extent Have LLMs Reshaped the Legal Domain So Far? A Scoping Literature Review. Information 2024, 15, 662. [Google Scholar] [CrossRef]
Tagarelli, A.; Simeri, A. LamBERTa: Law Article Mining Based on Bert Architecture for the Italian Civil Code. In IRCDL; Springer: Cham, Switzerland, 2022. [Google Scholar]
Siino, M.; Falco, M.; Croce, D.; Rosso, P. Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP Approaches. IEEE Access 2025, 13, 18253–18276. [Google Scholar] [CrossRef]
Mellinkoff, D. The Language of the Law; Wipf and Stock Publishers: Eugene, OR, USA, 2004. [Google Scholar]
Mertz, E. The Language of Law School: Learning to “Think Like a Lawyer”; Oxford University Press: Cary, NC, USA, 2007. [Google Scholar]
Tiersma, P.M. Legal Language; University of Chicago Press: Chicago, IL, USA, 1999. [Google Scholar]
Guarasci, R.; Silvestri, S.; De Pietro, G.; Fujita, H.; Esposito, M. BERT syntactic transfer: A computational experiment on Italian, French and English languages. Comput. Speech Lang. 2022, 71, 101261. [Google Scholar] [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 2021, 3, 1–23. [Google Scholar] [CrossRef]
Sachidananda, V.; Kessler, J.; Lai, Y.A. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, Virtual, 10 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 155–165. [Google Scholar]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8342–8360. [Google Scholar]
Dagan, G.; Synnaeve, G.; Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv 2024, arXiv:2402.01035. [Google Scholar] [CrossRef]
Occiglot. EU Tokenizer Performance. Available online: https://occiglot.eu/posts/eu_tokenizer_perfomance/ (accessed on 22 May 2025).
Byte-Pair Encoding tokenization. Available online: https://huggingface.co/learn/llm-course/en/chapter6/5 (accessed on 26 May 2025).
Summary of the Tokenizers. Available online: https://huggingface.co/docs/transformers/tokenizer_summary#byte-pairencoding (accessed on 26 May 2025).
Bogdanović, M.; Frtunić Gligorijević, M.; Kocić, J.; Stoimenov, L. Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT. Appl. Sci. 2025, 15, 615. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]

Figure 1. The overall distribution of the number of examples by the number of words in the examples.

Figure 2. The distribution of the number of examples by the number of words in the examples up to 100 words.

Figure 3. The average number of tokens used for tokenization of examples in relation to the number of words within examples.

Figure 4. The average number of tokens used for tokenization of examples in relation to the number of words within examples up to 500 words.

Figure 5. Average fertility in relation to the number of words in the examples.

Figure 6. Average fertility in relation to the number of words in for the examples up to 100 words.

Figure 7. Number of tokens in relation to the word frequency.

Table 1. Fertility results.

Tokenizer	Average	Median
SrBERTa v1	1.77	1.53
SrBERTa v2	1.61	1.40
Jerteh-81	1.72	1.44
XLM-RoBERTa	1.77	1.67
BERTić	3.73	3.65
NER4Legal_SRB	3.73	3.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bogdanović, M.; Frtunić Gligorijević, M.; Kocić, J.; Stoimenov, L. An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation. Appl. Sci. 2025, 15, 7491. https://doi.org/10.3390/app15137491

AMA Style

Bogdanović M, Frtunić Gligorijević M, Kocić J, Stoimenov L. An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation. Applied Sciences. 2025; 15(13):7491. https://doi.org/10.3390/app15137491

Chicago/Turabian Style

Bogdanović, Miloš, Milena Frtunić Gligorijević, Jelena Kocić, and Leonid Stoimenov. 2025. "An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation" Applied Sciences 15, no. 13: 7491. https://doi.org/10.3390/app15137491

APA Style

Bogdanović, M., Frtunić Gligorijević, M., Kocić, J., & Stoimenov, L. (2025). An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation. Applied Sciences, 15(13), 7491. https://doi.org/10.3390/app15137491

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Abstract

1. Introduction

2. Related Work

2.1. Language Models—From Large to Small and Back

2.2. Domain-Adapted Tokenizers and Tokenization Efficiency Metrics

3. Tokenizer Performance—Models, Dataset, and Performance Measure Insights

3.1. State-of-the-Art Multilingual, Slavic or Serbian-Specific Models

3.2. Dataset and Performance Measure Insights

3.3. Number of Tokens Used to Represent a Sequence

3.4. Fertility Measure

3.5. Number of Tokens in Relation to the Word Frequency Measure

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI