Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned

Ivačič, Nikola; Škrlj, Blaž; Koloski, Boshko; Pollak, Senja; Lavrač, Nada; Purver, Matthew

doi:10.3390/make7040142

Open AccessArticle

Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned

by

Nikola Ivačič

^1,2,*

,

Blaž Škrlj

¹

,

Boshko Koloski

^1,2

,

Senja Pollak

^1,2

,

Nada Lavrač

^1,3

and

Matthew Purver

^1,4

¹

Department of Knowledge Technologies, Jožef Stefan Institute, Jamova Cesta 39, 1000 Ljubljana, Slovenia

²

Jožef Stefan International Postgraduate School, Jamova Cesta 39, 1000 Ljubljana, Slovenia

³

Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia

⁴

School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 142; https://doi.org/10.3390/make7040142

Submission received: 9 October 2025 / Revised: 1 November 2025 / Accepted: 7 November 2025 / Published: 11 November 2025

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation.

Keywords:

multi-label text classification; multilingual text classification; retrieval; low-resource environments; less-represented languages; media monitoring; news topics

Graphical Abstract

1. Introduction

In today’s fast-paced world, media monitoring and analysis require near real-time capabilities to deliver timely and relevant information. This process involves categorising articles by content, attaching metadata tags, and synthesising information from multiple news sources [1,2,3]. Advanced Natural Language Processing (NLP) techniques are well-suited for these tasks, enabling the development of personalised news recommendation systems, automated content summarisation, and effective organisation and prioritisation of news items.

However, the operational reality of media monitoring—processing very large daily article volumes under strict latency and cost budgets—renders general-purpose LLMs ill-suited for sustained production use. High per-token inference costs, accelerator dependence, and third-party API constraints (including privacy, availability, and rate limits) conflict with industry requirements for low-latency, on-premises, and cost-predictable pipelines. These constraints are amplified for less-represented languages, where LLM coverage and training data are limited, and for extreme multi-label settings with frequent label changes, where continuous fine-tuning is infeasible. This motivates our focus on efficient, retrieval-based MLC that operates on consumer-grade hardware while maintaining robustness across multilingual and low-resource contexts.

Transformer-based models have become the dominant NLP approach, evolving from encoder architectures, such as BERT [4], to generative text-to-text models, such as T5 [5], GPT [6], and LLaMA [7], excelling in tasks such as text generation and classification [8,9]. Current LLM trends follow two paths: large-scale foundation models with high computational costs [10,11,12] and smaller, efficient generative models (SLMs) designed for resource-limited settings [13,14]. Although SLMs are a promising option for constrained environments [15], their effectiveness in real-world, data-intensive, multi-label classification is limited [16], particularly when handling multiple languages and a large label space. More specifically, the available pre-trained SLMs still lack pre-training in target South Slavic languages and require techniques such as pruning, knowledge distillation [17,18], or quantisation [19] to run on consumer-grade hardware, often facing licensing and latency challenges in deployment. Most often, generative LLMs for text classification or information retrieval (IR) are used in the context of synthetic data generation [16,20].

This paper introduces an approach for developing a classification system tailored to news articles, capable of accommodating longer text lengths, supporting multiple languages, and adapting to near-real-time label changes. To address these challenges, we constructed and down-sampled a large-scale dataset, NewsMon, obtained from a targeted Slovenian media monitoring industry archive. The original dataset, with over one million data points and more than ten thousand labels categorising client concepts under the term “client topics”, was adopted as the basis for developing our classification framework. We formulated the problem as a multi-label classification task, where client-defined “topics” are treated as labels.

Given the inherent complexity of the dataset and its large, long-tailed label space, our methodology draws upon research in Extreme Multi-Label Classification (XMC). We formulated our task within the XMC framework, as the classification problem exhibits poor scalability with an increasing number of labels, making traditional one-versus-all methods largely infeasible in this setting [21]. In particular, we built upon the Retrieval-Augmented Encoders for XMC (RAE-XMC) framework [22] to develop our solution. This framework employs a single-encoder architecture, where a single encoder is used to embed both the training examples and the label descriptions.

In this work, we address the following research questions on the task of client custom topic classification:

RQ1: Can we overcome the traditional one-versus-all (OVA) Transformer Encoder (TE) classification baselines in a multilingual setting?
RQ2: How effective is the method without label description embeddings?
RQ3: Can pre-trained multilingual IR models be used effectively without fine-tuning?
RQ4: Can we improve the method by incorporating additional contrastive learning with hard negative mining?

In summary, this study contributes the following advances and lessons learned:

The study demonstrates that IR models can outperform end-to-end fine-tuned models on specific datasets when working with an extremely large number of labels, providing a computationally efficient approach to scaling to an arbitrary number of labels.
The experiments demonstrate that a significant gap remains between end-to-end fine-tuned models and IR models on less-represented non-English tasks, rendering IR models more suitable for these tasks in the XMC setting.
When evaluating the modified RAE-XMC method on a novel and distinct dataset in a multilingual setting, we demonstrate that employing contrastive learning with precomputed hard negatives is beneficial.
Additionally, we demonstrate the effectiveness of the approach without relying on label space embeddings while also handling longer texts than those typically considered in traditional XMC tasks.
Finally, by comparing the performance of the modified RAE-XMC in less-represented, non-English languages, we demonstrate that the method can be easily adapted to multilingual settings, without significant loss in performance, as shown in the case of Slovene.

The key novelty of this work is the application of an information-retrieval-based approach to multilingual, multi-label text classification, specifically targeting low-resource languages and large or evolving label spaces. Unlike conventional end-to-end fine-tuned models, the proposed IR-based method performs competitively even without fine-tuning, making it highly adaptable to new or dynamic label sets with minimal overhead. In addition, it demonstrates strong scalability to large, dynamic label spaces, which is essential for real-time media monitoring applications. Advancing the field of multilingual large-scale text classification, this approach surpasses the XLM-R baseline on the NewsMon_sl dataset across all metrics, showcasing superior performance for underrepresented languages while maintaining the robustness and low latency required for real-world systems.

The remainder of this paper is structured as follows. Section 2 reviews the related work, providing context for our study. Section 3 details the dataset, including its creation, characteristics, and comparison with a well-studied dataset. In Section 4, we introduce our methodology, covering embedding models and baselines, and describe our experiments in Section 5. Section 6 presents the experimental results, followed by a summary of the advances and lessons learned in Section 7 and some directions for future research in Section 8.

2. Related Work

This section reviews related work on text classification for longer texts and imbalanced datasets, concluding with XMC and IR approaches that address relevant challenges for our task.

Transformer-based models now dominate multi-label text classification [8]. Most approaches use these models to obtain dense document representations, a key component of MLC. A typical document classification model includes a document encoder for representation learning and a classifier for label prediction [23]. Despite some models achieving high performance without the use of pre-trained language models, as noted by [24,25], methods using language pre-training, such as those discussed by [26,27,28], consistently outperform those that do not.

Machine learning models, including Transformer-based architectures, cannot directly process raw text; instead, they require a numeric representation. This transformation is provided by a tokeniser, which segments text into smaller subword units, i.e., tokens, rather than whole words. Tokenisation handles rare or out-of-vocabulary words while constraining vocabulary size, enhancing computational efficiency and model generalisation [29,30].

A significant challenge with these pre-trained multilingual encoder models is their limitation on the number of tokens they can process, which affects up to 50% of the instances in our dataset, as illustrated in Figure 1. This limitation originates from the

O (n^{2})

computational complexity of the Transformer model’s attention mechanism [31]. The effectiveness of various models which address long documents remains unclear due to the lack of standardised benchmarks and baselines [32]. Many efficient Transformer variants are compared only to BERT/RoBERTa, with limited evaluation against specific datasets [33,34]. In contrast, long-document classification models often compare only SOTA approaches without including BERT/RoBERTa baselines [35,36]. While [23] suggests that Longformer and Hierarchical Transformers improve long-text classification, other studies show BERT with extended random-sentence blocks can outperform more advanced models, highlighting dataset dependence [32,37] in the research. Notably, all long-text classification experiments used English-language pre-trained models and datasets.

Binary cross-entropy (BCE) loss [38] with sigmoid activation is a standard baseline for multi-label classification, where each label is treated as an independent binary prediction. However, BCE struggles with label imbalance and dependencies, prompting the use of alternative loss functions. Focal Loss [39] emphasises hard-to-classify instances, while Class-Balanced Loss (CB) [40] re-weights classes based on their effective instance count. Distribution-Balanced Loss [41] mitigates label co-occurrence redundancy and applies Negative Tolerant Regularisation (NTR) to down-weight easy negative labels. CB-NTR Loss [42] integrates CB and NTR for improved balance, and [27] introduces a regularisation term based on label co-occurrence and output activations.

In addition to incorporating label co-occurrence and imbalances into the loss function computation, contrastive learning techniques have gained popularity in specific multi-label classification tasks [43]. For multi-label text classification, these methods often calculate a combined loss that includes both binary cross-entropy and contrastive loss [44,45]. Interestingly, we encountered similar techniques widely used in computer vision [46,47], which are also predominant in learning dense document representations for dense retrievers in IR systems [48,49,50,51]. A common technique for improving IR-based MLC/XMC systems utilises contrastive loss with in-batch negative sampling [52] and even incorporates label semantics into neural network architectures by combining document and label representations [53,54,55]. These methods differ in their use of sparse or dense label representations and whether they employ the same or different encoders for labels and documents. For example, the XMC X-Transformer model utilises sparse label representations with Label Semantic Indexing, which aids in clustering labels and enhances classification performance [54]. The DEPL model [56] recently enhanced tail label prediction over X-Transformer by using sparse representations to generate pseudo descriptions, which are processed by the same encoder as documents. These and similar dual encoder (DE) models, which extend InfoNCE [57] contrastive loss, improve memorisation, and can generalise for unseen queries, are recent SOTA approaches in XMC [58]. However, a similar approach is employed in the RAE-XMC [22] framework, which does not require training end-to-end for prediction.

Research in low-resource NLP for large-scale multi-label text classification remains scarce, with most existing work focusing on tasks involving a limited number of classes, such as sentiment analysis, hate speech detection, or identification of sexist content [59,60].

3. Data Description

3.1. Data Preparation

The overall news monitoring archive contains over 85 million articles marked with more than 96,000 arbitrary “client topics.” To make the problem tractable and ensure a diverse representation of the language used in news media, we sampled articles from a news monitoring archive spanning January 2023 to December 2023. The sampled NewsMon dataset comprises one million articles from eight countries (Serbia, Slovenia, Bosnia and Herzegovina, Croatia, North Macedonia, Montenegro, Kosovo, and Albania) in nine languages (see Figure 2).

The selection process was guided by the labels corresponding to various industry sectors, enabling broad coverage of sector-specific news (see Figure 3). The dataset is annotated with more than twelve thousand distinct labels. It encompasses all labels tracked by the monitoring system, not just those from the selected industries used for the initial selection.

Each article is assigned labels representing concepts or “topics of interest” to clients. Despite the system’s vast number of labels, each client typically monitors only a few. Each label is associated with several simplified keyword Boolean expressions (KwEs) that determine the initial assignment of labels to news articles (see Table 1).

News articles are automatically labelled according to those KwEs, with additional filtering applied based on article metadata such as country, programme, section, and media outlet. In the final stage of the acquisition processing pipeline, a human moderator reviews the labels assigned to each article, as illustrated in Figure 4. The moderator may confirm, reject, or add additional labels based on their assessment. The labels and their corresponding KwEs are updated whenever a new client joins or leaves, or when a client’s interests change. These updates can occur frequently, with about one hundred monthly label changes on average. Arbitrary labels are also added later by the news media analysis department.

Upon further examination, we observed that the labels fall into two broad classes:

Named Entities (NE): mostly Persons, Organisations, Products, Brands and rarely Locations.
General Concepts (GC): a set of terms representing a concept, for example E-Mobility, Health insurance, Climate change, and Industrial waste.

We also found numerous labels with KwEs, most often from the NE class, that would match the same text. These duplicates could potentially be removed or the classification process optimised for the NE class of labels. However, we deliberately chose not to apply such interventions to evaluate the robustness of our methods in real-world applications, thereby avoiding the introduction of additional and potentially brittle label-cleaning pipelines. Some labels, especially those of the GC class, are often context-dependent. For example, while the “Competition” label applies to many industries and may seem like a duplicate, its meaning and KwEs vary depending on the context and client. Moreover, we observed that label sets are distinct across different languages in our setting, with limited overlap in their semantic or structural alignment, making cross-lingual analysis unlikely to yield meaningful performance benefits.

For data preparation, we first identified and removed duplicate or near-duplicate samples, which accounted for 18% of the dataset. These duplicates primarily originated from news aggregators, copied press releases, and other replicated content. Although such duplication is valuable in the news monitoring industry for assessing media reach, it was necessary to eliminate it for our classification task. Additionally, we refined the text data by extracting possible titles or metadata-polluted bodies and merging article titles with their respective body text to create a complete representation of each news article.

Due to the large dataset and label set size, we decided to downsample the NewsMon data. To speed up our research cycle for testing various classification methods and techniques, we selected the Slovene subset NewsMon_sl, which was chosen because it contains the largest number of distinct labels among all available language-specific subsets (see Figure 5). We focused exclusively on publicly available Slovene data, as the label sets in our dataset are distinct across languages. To ensure relevance and consistency with established benchmarks, we selected data from January 2023 to May 2023, targeting a dataset size and label distribution comparable to well-known multi-label datasets. We removed labels that appeared only once to prepare the dataset for training. Finally, we applied uniform random sampling to partition the dataset into training (80%), validation (10%), and test (10%) sets. Before sampling, the dataset was shuffled to minimise the potential impact of temporal label dynamics; however, some bias in evaluation may still be present due to a long-tail distribution.

3.2. Data Analysis

For comparison, we selected the EURLEX57K dataset [61], commonly referred to as EURLex-4.3K in XMC, as it is a widely used large-scale multi-label text classification dataset with label and instance sizes similar to those of our down-sampled version. More importantly, this comparison enables us to assess the performance of our classifiers in Section 4. We adopt some of the EURLEX57K evaluation metrics to evaluate model effectiveness, examining how the same models perform when trained on our dataset, which consists of a lesser-used language in the model pre-training. We first introduce label distribution metrics to facilitate this analysis before exploring the similarities and differences between the two datasets. Both datasets, NewsMon_sl and EURLEX57K, exhibit a long-tail label distribution, where most labels occur infrequently.

The NewsMon_sl dataset comprises 50,784 instances and 3,231 labels, with an average of 2.99 labels per instance. The label density is 0.00093, indicating that labels are distributed sparsely across the dataset. Since we removed single-occurring labels, this results in a slightly lower average of labels per instance but a higher label density, reflecting a more concentrated label distribution after filtering (see Table 2). Definitions of label distribution metrics are explained in Appendix A.

The EURLEX57K dataset, with 57,000 instances and 4271 labels, has an average of 5.07 labels per instance and a label diversity of 35,000, indicating a more challenging label space compared to the NewsMon_sl dataset. A high standard deviation for cardinality in NewsMon_sl datasets suggests that the number of labels per instance is spread over a broader range. Overall, the datasets exhibit similar properties regarding the long-tail label distribution, as illustrated in Figure 6 and Figure 7.

We applied Uniform Manifold Approximation and Projection (UMAP) to visualise the label space of the dimensionality-reduced training sets for each dataset separately (Figure 8 and Figure 9). While not intended for direct comparison, the visualisations reveal that neither dataset exhibits a clear underlying structure or distinct clustering patterns.

4. Methodology

The proposed methodology is presented in Figure 10. In summary, compared to the original methodology, shown in Figure A2 of Appendix C, we have added IR model fine-tuning using hard negatives and did not use label descriptions, as they were not available.

The methodology is explained in detail below. We built on the idea of the RAE-XMC framework initially presented in [22], which decomposes

p (y_{ℓ} | q)

, the conditional probability of label ℓ being relevant to a test input q, into retrieval from knowledge memory K and prediction conditioned on the retrieved key k and test input q. Therefore, given the test input q, retrieve relevant keys k from a knowledge memory K, and then use the predictive score for label

y_{ℓ}

to predict the sample label set. The method decomposes the classification process to sampling from the retriever distribution

p (k | q; θ)

and predictive score

p (y_{ℓ} | k, q)

:

p (y_{ℓ} ∣ q) = \sum_{k \in K} \underset{Predictor}{\underset{︸}{p (y_{ℓ} ∣ k, q)}} \cdot \underset{Retriever}{\underset{︸}{p (k ∣ q; θ)}} .

(1)

The knowledge memory K consists of training instances X and label descriptions Z each paired with label values V, where V matrix is composed of training ground truth Y label values and an identity I matrix representing ground truth for each label description:

K = [\begin{matrix} x_{1}^{⊤}, \dots, x_{N}^{⊤}, z_{1}^{⊤}, \dots, z_{L}^{⊤} \end{matrix}] = [\begin{matrix} X, Z \end{matrix}] \in R^{(N + L) \times d},

(2)

V = [\begin{matrix} λ Y, (1 - λ) I_{L} \end{matrix}] \in {[0, 1]}^{(N + L) \times L} .

(3)

The retriever is then defined as:

p (k | q; θ) = \frac{exp (s_{θ} (q, k) / τ)}{\sum_{k^{'} \in K} exp (s_{θ} (q, k^{'}) / τ)} \sim Softmax ((q^{⊤} K^{⊤}) / τ),

(4)

and predictor as a simple label lookup:

p (y_{ℓ} | k, q) = V_{k, ℓ} \in {0, 1},

(5)

where the underlying scorer is a dense embedding-based model

s_{θ} (k, q) = 〈 f_{θ} (k), f_{θ} (q) 〉

which outputs d-dimensional embeddings, and

τ

is the temperature controlling the skewness of the Softmax distribution.

Our datasets lack label descriptions, and initial experiments with synthetically generated pseudo-descriptions, similar to DEPL [56], produced suboptimal results. Consequently, we opted to simplify the model and conduct our experiments without label pseudo-descriptions. Therefore, our knowledge memory is simplified to:

K = [\begin{matrix} x_{1}^{⊤}, \dots, x_{N}^{⊤} \end{matrix}] = [\begin{matrix} X \end{matrix}] \in R^{N \times d},

(6)

where the ground-truth label values are:

V = [\begin{matrix} λ Y \end{matrix}] \in {[0, 1]}^{N \times L} .

(7)

We retained the

λ

hyperparameter and reformulated the simplified model in matrix form as follows:

\hat{p} = Softmax ((q^{⊤} K^{⊤}) / τ) \cdot λ Y \in R^{1 \times L} .

(8)

For embedding and similarity search, we employed the best-performing pre-trained IR model from our zero-shot experiments (see Section 5), which we further fine-tuned using hard negatives. The selected pre-trained retrieval model encodes the input query q into a sequence of hidden states

H_{q}

using a text encoder. The query representation is obtained by applying layer normalisation to the hidden state corresponding to the special “[CLS]” token of the Transformer Encoder:

f_{θ} (q) = norm (H_{q} [0]) .

(9)

Similarly, a key k (training document) is encoded to produce hidden states

H_{k}

, and its representation is obtained as:

f_{θ} (k) = norm (H_{k} [0]),

(10)

applying the layer normalisation:

norm (x) = γ \cdot \frac{x - μ}{\sqrt{σ^{2} + ϵ}} + β,

(11)

where

μ

is the mean of the vector

x \in R^{d}

(the [CLS] hidden state),

σ^{2}

is the variance,

ϵ

is a small stability constant and

β, γ \in R^{d}

are learnable scale and shift parameters.

The relevance between the query and the key is then computed as the inner product of their embeddings. We performed further retrieval optimisation with hard negative (HN) mining, inspired by the approach outlined in [62], to be used in conjunction with contrastive learning.

Specifically, we embedded all training samples using the same encoder and identified the closest samples that did not share labels as hard negatives. We operate under a fixed negative budget of at most 15 in-batch negatives per query due to memory constraints during training. Therefore, for each training query q, we selected up to 15 negative keys

k \in K^{'}

that maximise the total similarity:

K^{'} = \underset{S \subseteq K, | S | = 15}{argmax} \sum_{k \in S} s_{θ} (q, k), subject to y_{q} \cap y_{k} = Ø

(12)

and one positive key:

k^{*} = \underset{k \in K}{argmax} s_{θ} (q, k), subject to y_{q} \cap y_{k} \neq Ø

(13)

where

q, k \in K

and their corresponding label sets

y_{q}, y_{k} \in V

.

The training process was optimised using the same contrastive InfoNCE loss as in pre-training, formally defined by the following loss function:

L_{InfoNCE} (\cdot) = - log (\frac{exp (s_{θ} (q, k^{*}) / τ)}{\sum_{k \in {k^{*}, K^{'}}} exp (s_{θ} (q, k) / τ)}),

(14)

where

k^{*}

and

K^{'}

are the positive and negative samples to the query q;

s_{θ} (p, k) = 〈 f_{θ} (k), f_{θ} (q) 〉

is a scoring function of the query and sample dense embeddings

f_{θ}

.

Lastly, our approach leveraged multi-objective genetic algorithm hyperparameter optimisation through the Optuna framework [63], which provides superior efficiency compared to traditional grid search methods. The hyperparameter search space encompassed critical parameters including retriever temperature

τ

, weighting factor

t o p -k

and

λ

, predictor values, with subset accuracy serving as the optimisation objective across 1000 trials (see Appendix D.5 for details).

Algorithm 1 summarises our test-time procedure for the simplified RAE-XMC variant. Given an input text q, we encode it with the hard-negative–fine-tuned retriever, retrieve the

t o p -k

nearest training instances from the index, and convert their similarity scores into weights via a temperature-scaled Softmax with parameter

τ

. These weights linearly combine the pre-scaled label matrix

V = λ \cdot Y

to yield class probabilities, which are then thresholded to produce the final label set. Unless otherwise noted, k,

τ

, and the decision threshold are fixed to the values selected during hyperparameter search.

Algorithm 1 Modified RAE-XMC Inference

1:: Example hyperparameters (illustrative): $t o p -k \leftarrow 50$ , threshold $\leftarrow 0.5$ , $τ \leftarrow 0.04$
2:: Pre-scale values: $V \leftarrow λ \cdot Y$
3:: function RAEXMC_Mod(q, hn_encoder, index, V, $t o p -k$ , $τ$ , threshold)
4:: embedding ← hn_encoder(q)
5:: K_matches ← index(embedding, $t o p -k$ )
6:: $α \leftarrow$ softmax(K_matches $/ τ$ )
7:: probabilities $\leftarrow α \cdot V$
8:: return (probabilities > threshold)
9:: end function

5. Experimental Setting

For the initial set of experiments, which tested traditional methods, we used a term frequency–inverse document frequency (TF-IDF) representation in combination with one-versus-all (OVA) classifiers, specifically Support Vector Machines [64] (SVM) and Logistic Regression [65]. For both models, we used grid search [66] to obtain the best-performing hyperparameters (see Appendix D for details).

As a baseline, we trained the XLM-RoBERTa-base [67] model five times with different random seeds using binary cross-entropy (BCE) loss and a standard classification head, which produces a probability per label. For the baseline training, we used a standard set of hyperparameters (see Appendix D for details), without performing any hyperparameter optimisation, as our results (0.760 micro-averaged F1-score) exceeded the best reported performance (0.732) by the original EURLEX57K authors [61].

To compare to classical k-Nearest Neighbours (KNN) approaches, we selected the ML-KNN algorithm [68]. The ML-KNN algorithm leverages the KNN approach to classify instances in a multi-label setting by combining neighbour-based similarity with Bayesian inference. First, it identifies the k nearest neighbours of a given data point using a predefined distance metric, typically Euclidean distance. Then, it applies Bayesian inference to estimate the probability of each label being relevant to the test instance based on the labels of its k nearest neighbours. Finally, the algorithm utilises the Maximum a posteriori (MAP) principle to determine the most likely set of labels for the unseen instance, considering the distribution of labels among its nearest neighbours. For all experiments, we employed Faiss’s [69,70] graph-based Approximate Nearest Neighbour (ANN) Hierarchical Navigable Small World (HNSW) algorithm for KNN search. Samples were classified as positive when their predicted probability exceeded a threshold of 0.5.

When the focus is strictly on pre-trained information retrieval (IR) models, computational efficiency is closely tied to both the architecture and the task-specific tuning present at deployment. Most Small Language Models (SLMs) available today, such as those benchmarked by [71], are generalist language models with optional support for retrieval, question-answering, or in-context learning. However, very few are distributed with explicit, out-of-the-box pre-training for dense or hybrid multilingual IR.

To identify the most effective IR model and assess the feasibility of our approach, we conducted zero-shot experiments. In these experiments, we embedded the training set and classified labels based on the labels of the nearest neighbour. To mitigate the truncation issue discussed in Section 2, in these experiments, we selected three competitive multilingual retrieval models capable of processing extended contexts of up to 8000 subword tokens, listed below:

Alibaba-NLP/gte-multilingual-base (GTE-mb) [49];
jinaai/jina-embeddings-v3 (Jina-v3) [50];
BAAI/bge-m3 (BGE-M3) [51].

The selected model specifications are outlined in Table 3.

These models were either initialised from an XLM-RoBERTa checkpoint or trained from scratch, making them directly comparable to our main experimental baseline. They share a similar transformer encoder architecture and parameter count, and undergo additional multi-stage fine-tuning for IR tasks. Notably, none of these models were explicitly trained on any target South Slavic language.

According to the MTEB benchmark [72], these multilingual models are the top performers among sub-billion parameter language models specifically designed and pre-trained for large-scale, multilingual IR. They exhibit a highly efficient computational profile, requiring only approximately ∼2.6 GB of GPU memory (see Section 5) for inference in our experiments, which enables deployment on edge devices or targeted consumer-grade hardware [73]. This makes pre-trained IR models particularly desirable for real-world deployments with strict latency, scale, or environmental constraints.

In contrast, most generic small language models (SLMs) with fewer than 1 billion parameters, unless specifically fine-tuned, tend to fall short in both text classification tasks [74,75] and retrieval performance for practical applications [72].

Finally, we selected the BGE-M3 model for further experimentation, as it demonstrated the best performance across both datasets, as shown in Table 4 and Table 5. The zero-shot experiments capture the lower bound of the model’s performance, representing the worst-case expected results under dynamic label space conditions; therefore, a time-based holdout was deemed unnecessary.

Note that in the experimental evaluation, resulting in Table 4 and Table 5, we conducted a series of experiments using the evaluation metrics explained in Appendix E. We assessed all models using a consistent data split across the entire dataset (setting ’All’ in column ’Frequency’) and separately on frequent labels (those occurring 500 times or more) and rare labels (those occurring 10 times or fewer), presented in ’Frequent’ and ’Rare’ in column ’Frequency’, respectively. In summary, we can observe overall better performance in the zero-shot experiments on the NewsMon_sl dataset, which we attribute to lower label diversity compared to EURLEX57K.

Next, we conducted hard negative (HN) mining using the BGE-M3 model on both the EURLEX57K and NewsMon_sl training sets, constructing HN datasets consisting of one positive sample and up to fifteen hard negative samples. We fine-tuned the same model on the HN dataset using the default hyperparameters of the BGE-M3 model. Finally, we performed a hyperparameter search on the validation set for the

λ

,

τ

and

t o p -k

(see Appendix D).

Inference Efficiency (Latency & Memory)

We quantified the runtime efficiency of our method on a single consumer-grade GPU accelerator with 16 GB of GPU memory, focusing on both device/host memory consumption, as well as end-to-end inference latency. The model’s steady-state device footprint (parameters and runtime buffers) was 1147.4 MB of GPU memory. Under load, the smallest tested batch (batch size

= 2

) reached an average peak of 2645.0 MB on the GPU, whereas the largest batch that fit on the device (batch size

= 384

) drove the maximum allocated GPU memory for the process to 7578.7 MB. On the host, the process memory peaked at an average of 2940.6 MB, where the indexed “knowledge memory” consumed 748.5 MB (see Table 6). Latency and memory consumption were evaluated across 10 selected batch sizes spanning 2 to 384, with 10 trials per batch. Measured per-sample latencies (see Figure 11) are up to an order-of-magnitude lower than the time-to-first-token (TTFT) typically reported for decoder-type small language models running on data-centre hardware, as discussed in the Introduction [71]. Across the entire sweep, both latency and memory measurements exhibited negligibly small standard deviations, indicating stable and repeatable inference behaviour.

6. Results

6.1. Classification Performance Analysis

In this section, we present the results of our experiments. For the primary analysis, we report micro- and macro-averaged F1-score, precision, and recall as label-based metrics, while subset accuracy is the example-based metric. We evaluate the SVM and Logistic Regression (LogReg) classifiers using the one-versus-all approach with TF-IDF representations (OVA-TFIDF) alongside a fine-tuned XLM-RoBERTa baseline (XLMR). In addition, we evaluate the original BGE-M3 model and the hard-negative fine-tuned BGE-M3 model (FT-BGE) across zero-shot, ML-KNN, and simplified RAE-XMC methods. All models are evaluated on the NewsMon_sl and EURLEX57K datasets across all labels, followed by a detailed analysis of performance on frequent and rare labels. Finally, we conclude the result analysis by attributing the hard-negative fine-tuning and the modified RAE-XMC approach.

The first set of experimental results in Table 7 shows that the FT-BGE retrieval model outperformed all other models on the NewsMon_sl dataset across all metrics. In terms of micro-averaged F1-score (micro-F1), macro-averaged precision, and subset accuracy, the simplified RAE-XMC method performed the best. In contrast, the zero-shot approach yielded the highest macro F1-score, primarily due to its superior recall. Notably, even the non-fine-tuned BGE-M3 retrieval model outperformed the baseline with the RAE-XMC method across all metrics except for micro precision. When examining the zero-shot performance improvement on the NewsMon_sl dataset after fine-tuning the IR model, we observe an average gain of 16.7% in micro-averaged metrics, 10.1% in macro-averaged metrics, and a 20.6% increase in subset accuracy.

An examination of the EURLEX57K dataset experiment results in Table 8 reveals that the micro-averaged F1-score and precision are significantly lower for the RAE-XMC method than the baseline. Conversely, the fine-tuned BGE-M3 model, combined with the modified RAE-XMC method, achieved the highest macro-averaged F1-score and subset accuracy.

However, the zero-shot method demonstrates higher recall than RAE-XMC, consistent with the findings from the previous set of experiments on NewsMon_sl. These findings suggest that further enhancements are needed to improve the recall of our approach compared to the zero-shot method. Moreover, improvements in micro-averaged F1-score and precision are also required to surpass the baseline performance for the EURLEX57K dataset. We hypothesise that this outcome can be attributed to a combination of the following factors:

The relatively worse performance of multilingual models on English-language tasks, as evidenced by differences in zero-shot performance (see Table 4 and Table 5) and MTEB benchmark rankings (Table 3);
The higher degree of label diversity present in the EURLEX57K dataset, which makes correct predictions harder;
The suboptimal performance of our method on the frequent labels within this dataset (see Table 9 in comparison to Table 10).

Notably, the SVM classifier outperforms the baseline in terms of subset accuracy and achieves the highest macro-averaged precision among all evaluated models.

Similarly, we observe a zero-shot performance improvement on the EURLEX57K dataset after fine-tuning the IR model, with an average gain of 2.8% in micro-averaged metrics, 4.5% in macro-averaged metrics, and a 10.6% increase in subset accuracy.

For frequent (head) labels in NewsMon_sl that appear 500 times or more (see Table 9), the FT-BGE model combined with the modified RAE-XMC approach outperformed all other models in terms of F1-score, precision, and subset accuracy. The exceptions were recall, where the zero-shot method performed best, consistent with the results observed in our initial experiments across all labels. If we observe only the non-modified BGE-M3 model, we can see that its performance does not surpass the baseline in this third set of experiments.

In contrast, the fourth set of experiments on EURLEX57K indicates that our approach fell significantly short of surpassing the baseline. This result suggests that the XLMR model exhibits substantially stronger classification performance for the English language when presented with enough training samples and a smaller label space (see Table 10). This suggests that an optimal approach could involve a gated ensemble of classifiers based on label frequency. However, the dynamic nature of the label space, coupled with the inconsistent results observed on the frequent labels of the NewsMon_sl dataset, presents significant challenges to this approach. In experiments involving frequent labels, we observe that the macro-averaged F1 score can be relatively high even when the subset accuracy remains low. This discrepancy arises from the fundamental difference between the two metrics: macro F1 evaluates performance at the class level, while subset accuracy operates at the instance level. As a result, a model may perform well when assessed per label but struggle to predict all labels correctly for a given instance. Moreover, subset accuracy is an “all-or-nothing” metric, highlighting that the model often predicts some, but not all, of the correct labels. Compared to the first two experiments, where subset accuracy exceeded macro F1, the influence of rare labels in this setting leads to the opposite effect on these metrics.

Finally, when analysing the performance on rare (tail) labels, it is evident that ML-KNN and the baseline approach performed poorly.

In both sets of experiments on the EURLEX57K and NewsMon_sl datasets, the baseline and ML-KNN models were unable to classify rare labels effectively, as shown in Table 11 and Table 12. This implies that a minimal amount of training data is insufficient for these approaches to be effective. Interestingly, in experiments on rare labels within the EURLEX57K and NewsMon_sl datasets, both zero-shot methods outperformed the simplified RAE-XMC approach in terms of accuracy.

6.2. Factor Attribution Analysis

Most gains come from hard-negative fine-tuning; RAE-XMC adds precision but slightly hurts tail recall (See Table 13 for deltas by factor). On the NewsMon_sl dataset, most of the gains stem from hard-negative (HN) fine-tuning of the retriever, which yields

+ 9.89

pp for micro-F1 and

+ 8.80

pp for subset accuracy, relative to the zero-shot 1-Nearest Neighbour (1-NN) retriever without fine-tuning. The RAE-XMC retrieval/predictor contributes a smaller, precision-oriented boost (

+ 6.54

pp for micro-F1 and

+ 1.91

pp for accuracy over 1-NN with the same encoder), but slightly reduces tail performance, consistent with softmax-weighted neighbour aggregation favouring head labels (i.e., neighbour weighting skews toward frequent labels; Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12). We expect hard-negative (HN) fine-tuning to increase classification recall through improved retrieval precision, since it makes the retriever more discriminative (retrieved neighbours are more likely to share true labels with the query). On the other hand, we expect the RAE-XMC retrieval/predictor mechanism, a softmax-weighted averaging over multiple neighbours, to increase classification precision. For the NewsMon_sl, the interaction between RAE-XMC and HN is mildly negative for micro-F1 (all:

- 1.80

pp, frequent:

- 0.95

pp, rare:

- 4.31

pp) and slightly positive for subset accuracy on the all and frequent splits (all:

+ 2.83

pp, frequent:

+ 1.08

pp) but negative on the rare split (

- 3.66

pp), indicating overlapping effects on recall and complementary effects on precision, consistent with the observation that adding RAE-XMC after HN does not increase recall.

On EURLEX57K, both components contribute small, partially overlapping improvements (from

+ 1.85

pp to

+ 3.20

pp for micro-F1 and from

+ 0.35

pp to

+ 2.46

pp for accuracy), and their interaction term is near-zero for accuracy, indicating largely additive effects.

Overall, on NewsMon_sl, hard-negative fine-tuning is the primary driver of improvement. At the same time, the RAE-XMC retrieval/predictor yields smaller but consistent gains on the all and frequent splits across both datasets. For the rare split, the modified RAE-XMC generally reduces performance, so it should be applied with caution on tails. We hypothesise that the larger gains on NewsMon_sl versus EURLEX57K could also arise from the keyword-driven labelling pipeline in NewsMon_sl (Section 3, Table 1), which introduces lexical “anchors” that retrieval can exploit.

7. Conclusions: Summary of Advances and Lessons Learned

We demonstrate that information retrieval (IR) models outperform end-to-end fine-tuned approaches in settings with a large number of labels when dealing with less-represented languages. Furthermore, our IR-based method demonstrates a degree of effectiveness even without fine-tuning, making it well-suited for continuously evolving systems with dynamic label spaces, particularly when scalability to an arbitrary label space is a critical requirement.

The results indicate that while our approach surpasses the XLMR baseline on the NewsMon_sl dataset, the overall findings for both datasets remain inconclusive. Specifically, our approach outperforms the XLMR baseline across all reported metrics on the NewsMon_sl dataset, where HN fine-tuning is the largest contributor. However, on EURLEX57K, the micro-averaged F1-score and precision remain lower than those of the baseline. In particular, on the EURLEX57K dataset, where labels are more consistent and data are predominantly in English, the method leaves room for improvement, as our method surpassed the baselines only on macro-averaged scores and subset accuracy.

These findings suggest that pre-trained, end-to-end fine-tuned models may be more effective for tasks in the predominant language used during model pre-training, particularly when the number of model parameters is limited due to resource constraints [76]. In contrast, multilingual IR models tend to underperform in comparison. Both observations underscore the importance of fairness in the design of multilingual models. Less-represented languages are often under-resourced and underrepresented in large-scale pre-training corpora, resulting in lower quality representations and limited task performance [77]. Nevertheless, retrieval-based models excel with rare labels, even when no fine-tuning is applied. As a result, the findings for research question RQ1 remain inconclusive across both datasets, underscoring the need for further investigation and refinement of our approach in high-resource, monolingual settings.

On the other hand, zero-shot and unchanged IR model experiments demonstrate that pre-trained multilingual retrieval models can be used effectively (RQ3), as also suggested by [22], provided that the predictor component is further improved or a true dual-encoder architecture is employed [58].

This also addresses the research question RQ2, indicating that while the method’s effectiveness without label descriptions is limited, it presents new research opportunities. In particular, incorporating a trainable

λ

vector or matrix model that assigns importance to the labels of each training sample could enhance performance.

Furthermore, we demonstrate that incorporating contrastive learning with hard negative sampling further improves the IR model within our classification setting, addressing research question RQ4. Given that the simplified RAE-XMC method outperformed the XLMR baseline without requiring fine-tuning on the NewsMon_sl dataset, our results demonstrate the potential for developing a method suitable for rapid label updates. Moreover, the approach relies on a model with fewer than one billion parameters, making it well-suited for low-resource environments.

Notably, the proposed method is language-agnostic and thus applicable to less-represented or low-resource languages, such as Slovene and Serbian, as the underlying retrieval models were not specifically fine-tuned on these languages. Although we did not conduct cross-lingual experiments, we believe the approach can be effectively extended to cross-lingual settings, provided the retriever exhibits strong cross-lingual alignment and is fine-tuned with hard negatives drawn from datasets across all target languages; such transfer is likely to be especially effective for languages within the same family (see Appendix B of preliminary research).

Additionally, our information retrieval approach enhances scalability further through the use of knowledge memory embeddings per language and model. This compact embedding footprint enables the system to efficiently support numerous languages in parallel or even with a single model. Importantly, these knowledge memory embeddings can be offloaded to external retrieval systems when needed, allowing for flexible deployment on various hardware configurations and seamless integration with scalable, distributed retrieval infrastructures.

8. Future Work

Future work will focus on several key areas of improvement to enhance the classification model and consequently improve the reliability of result interpretation. First, we plan to extend the pre-training of our models to include the Serbian, Bosnian, Croatian, and Macedonian languages, allowing us to assess the effectiveness of our methods in a truly multilingual context (see Appendix B for preliminary research). Additionally, we will explore the impact of fine-tuning the IR model on generalisation and its applicability to unseen labels. Finally, we intend to leverage the provided simplified Keyword-Boolean expressions to identify label-relevant text passages and experiment with smaller text samples, which could help improve precision. This approach may also facilitate the generation of synthetic label descriptions, which could be used in contrastive learning to enhance precision further.

Author Contributions

Conceptualization, N.I.; methodology, N.I., B.Š. and B.K.; software, N.I.; validation, B.Š., B.K. and M.P.; formal analysis, B.Š., B.K. and M.P.; investigation, N.I., B.Š. and B.K.; resources, N.I.; data curation, N.I.; writing—original draft preparation, N.I. and N.L.; writing—review and editing, S.P., N.L. and M.P.; visualization, B.Š.; supervision, S.P., N.L. and M.P.; project administration, S.P. and N.L.; funding acquisition, S.P. and N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Slovenian Research and Innovation Agency (ARIS) through the following grants: GC-0002 (Large Language Models for Digital Humanities), GC-0001, P2-0103 (Knowledge Technologies), PR-12394 (Young Researcher Grant), J4-4555, and L2-50070 (Embeddings-based Techniques for Media Monitoring Applications). In addition, this work received partial funding from the European Union under the HORIZON-WIDERA-2023-TALENTS-01-01 program, grant No. 101186647 — AI4DH.

Data Availability Statement

The EURLEX57K (https://auebgr-my.sharepoint.com/:u:/r/personal/nlp_aueb_gr/Documents/nlp/software_and_datasets/EURLEX57K/datasets.zip, accessed on 11 August 2025) data presented in the study are openly available by the Department of Informatics—Athens University of Economics and Business, Natural Language Processing Group (http://nlp.cs.aueb.gr/, accessed on 11 August 2025). The NewsMon datasets presented in this article are not readily available because of copyright restrictions. Requests to access the datasets should be directed to the Department of Knowledge Technologies, Jožef Stefan Institute, 1000 Ljubljana, Slovenia.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Label Distribution Metrics

To compare the datasets, we used common label distribution metrics [78,79]. We define Label Cardinality as the average number of labels per sample:

Card (N) = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} |

(A1)

Next, we define Label Density as the average number of labels per instance divided by the total number of labels:

Dens (N) = \frac{1}{n} \sum_{i = 1}^{n} \frac{| Y_{i} |}{| L |}

(A2)

Finally, we define Label Diversity as the number of distinct label combinations in the sample set:

Diver (N) = | Y_{x} \subseteq L : (x, Y_{x}) \in N |,

(A3)

where N is the multi-label dataset, n is the number of instances in N (i.e.,

n = | N |

),

Y_{i}

is the set of labels for the i-th instance, and L is the set of all possible labels in the dataset.

Appendix B. Serbian and Macedonian Language Preliminary Research

Appendix B.1. Data Analysis

Table A1. Dataset statistics for the complete set (NewsMon), selected initial set of labels (NewsMon_initial), the down-sampled datasets in Serbian and Macedonian (with and without duplicates).

Dataset	Samples	Labels	Cardinality	Density	Diversity
NewsMon	1,068,261	12,305	3.392 ± 4.017	0.00028	267,564
NewsMon_initial	1,068,261	1960	2.213 ± 2.316	0.00114	183,361
NewsMon_sr+dups	88,074	2149	2.835 ± 3.181	0.00132	20,272
NewsMon_sr	83,947	2128	2.826 ± 3.190	0.00133	19,809
NewsMon_mk+dups	20,451	505	4.821 ± 3.863	0.00955	3022
NewsMon_mk	12,133	494	4.341 ± 3.773	0.00879	2556

We selected Macedonian, Serbian and Slovene as representative languages because they exhibit the highest degree of morphological diversity within our dataset. These languages differ notably, making them valuable for assessing model generalisation across varied morphologies. In contrast, Bosnian and Croatian were excluded from this analysis, as they are linguistically and morphologically closely related to Serbian and would therefore contribute limited additional variation or insight into cross-lingual robustness. To align the sample sizes to the same magnitude, we selected two months of data for the NewsMon_sr dataset and twelve months of data for the NewsMon_mk dataset.

Figure A1. (a) The distribution of label occurrences in the NewsMon_sr dataset. (b) The distribution of label occurrences in the NewsMon_mk dataset.

Table A2. NewsMon_sr (all labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), BGE-M3, and fine-tuned BGE-M3 under zero-shot and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	69.15	76.45	63.13	33.43	39.26	30.97	44.80
LogReg	OVA-TFIDF	68.70	77.54	61.66	32.61	39.64	29.51	44.67
BGE-M3	zshot	63.71	61.93	65.59	33.72	33.48	36.40	45.04
BGE-M3	RAE-XMC	68.86	76.85	62.38	34.40	39.41	32.70	47.92
FT-BGE_sl	zshot	65.65	65.07	66.23	35.19	35.73	37.22	46.33
FT-BGE_sl	RAE-XMC	68.62	82.64	58.66	33.50	39.69	31.05	48.98

Note: Best results are shown in bold, and second-best results are underlined.

Appendix B.2. Preliminary Experimental Results

We report preliminary results for NewsMon_sr and NewsMon_mk in Table A2 and Table A3, respectively. The XLM-R baseline is not included in this release. As summarised in Table A4, RAE-XMC contributes the largest single share of the gains, especially for subset accuracy (+2.88 pp on NewsMon_sr, +4.00 pp on NewsMon_mk). At the same time, hard-negative fine-tuning adds complementary improvements in micro-F1 (+1.94 pp on NewsMon_sr, +3.56 pp on NewsMon_mk). The small negative interaction terms suggest partial redundancy between the two components, yet the combined effect remains clearly positive overall (+4.90 pp and +5.96 pp in μF1 for NewsMon_sr and NewsMon_mk, respectively). Importantly, the HN–fine-tuned encoder (FT-BGE_sl) was transferred from Slovenian rather than tuned on Serbian or Macedonian; therefore, these effects reflect cross-lingual transfer and can diverge from the main, language-specific results, where the HN fine-tuning was the highest contributor. Consequently, factor attribution and classification performance figures should be interpreted as provisional, reflecting cross-lingual transfer rather than language-specific optimisation.

Table A3. NewsMon_mk (all labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), BGE-M3, and fine-tuned BGE-M3 under zero-shot and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	81.72	87.67	76.53	46.07	49.96	44.14	49.43
LogReg	OVA-TFIDF	80.98	88.38	74.72	44.23	49.46	41.64	47.71
BGE-M3	zshot	76.72	74.62	78.95	44.67	44.28	46.80	47.14
BGE-M3	RAE-XMC	80.45	85.81	75.72	43.39	47.33	41.92	51.14
FT-BGE_sl	zshot	80.29	79.48	81.11	47.95	47.74	49.34	49.02
FT-BGE_sl	RAE-XMC	82.68	90.56	76.06	45.17	50.13	42.97	51.63

Note: Best results are shown in bold, and second-best results are underlined.

Table A4. Factor attribution (absolute %-point deltas) for Serbian (NewsMon_sr) and Macedonian (NewsMon_mk) datasets. Effects correspond to hard-negative fine-tuning (HN), RAE-XMC retrieval, and their interaction.

Dataset	ΔμF1 HN	ΔμF1 RAE	ΔμF1 Int.	ΔμF1 Total	ΔAcc HN	ΔAcc RAE	ΔAcc Int.	ΔAcc Total
NewsMon_sr	+1.94	+5.15	−2.18	+4.90	+1.29	+2.88	−0.23	+3.95
NewsMon_mk	+3.56	+3.72	−1.33	+5.96	+1.88	+4.00	−1.39	+4.49

Note: “HN” = hard-negative fine-tuning effect (FT-BGE zero-shot – BGE-M3 zero-shot); “RAE” = RAE-XMC retrieval effect (BGE-M3 RAE – BGE-M3 zero-shot); “Int.” = interaction term explaining residual; “Total” = overall delta from BGE-M3 zero-shot → FT-BGE RAE-XMC. Values derived from Table A2 and Table A3.

Appendix C. The Original RAE-XMC Framework

Figure A2. Original RAE-XMC framework. Reproduced from Wang, Y.-S., Chang, W.-C., Jiang, J.-Y., Zhang, J., Yu, H.-F., & Vishwanathan, S. V. N. (2025), Retrieval-augmented Encoders for Extreme Multi-label Text Classification [22], arXiv:2502.10615. Licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. Changes: [resized; otherwise “no changes”].

Appendix D. Hyperparameters

Appendix D.1. SVM Hyperparameters

C-Support Vector Classification was used for the SVM classifier, implemented by the RAPIDS cuML library [80] with the Scikit-learn [66] multi-target classification strategy consisting of fitting one classifier per target, where for the SVC we used:

Radial basis function (RBF) kernel,
Regularisation parameter (C) was set to 10,000 for NewsMon_sl and 10 for EURLEX57K.
Gamma coefficient for RBF was set to 0.001 for NewsMon_sl and 1.0 for EURLEX57K.
Maximum term document frequency was set to 0.8 for NewsMon_sl and 1.0 for EURLEX57K.

The hyperparameters were obtained through a grid search using Scikit-learn. Due to the high memory consumption of the SVM classification, which exceeded the available hardware when all classifiers were trained simultaneously, we partitioned the label space into batches of 240 to maintain memory usage below 16 GB during both training and evaluation. For the TF-IDF representation, we used Scikit-learn’s TfidfVectorizer with a maximum of 10,000 features.

Appendix D.2. Logistic Regression Hyperparameters

We used the same libraries and strategy as for SVM for the logistic regression model with the best hyperparameters:

L2 penalty,
Tolerance for stopping criteria: 0.0001,
Inverse of regularisation strength (C) set to 1000 for NewsMon_sl and 10 for EURLEX57K.
Maximum term document frequency was set to 0.8 for NewsMon_sl and EURLEX57K.

For both SVM and Logistic regression, we used a single consumer-grade GPU (Nvidia GeForce 4070 Ti Super).

Appendix D.3. XLM-RoBERTa-Base Fine-Tuning Hyperparameters

For the baseline training, we did not perform hyperparameter optimisation; all the models were fine-tuned using the default set of hyperparameters from HuggingFace’s Transformers library [81], optimised for a large selection of common NLP tasks:

AdamW optimiser with a learning rate of 3 × 10^{$- 5$}.
Weight decay set to 0.01 for regularisation.
Training for a maximum of 30 epochs.
Batch size of 16.
Maximum length of 512 sub-word tokens.
30 epochs.
Best model selection based on the validation set micro F1-score.

We used a single consumer-grade GPU (NVIDIA GeForce 4070 Ti Super).

Appendix D.4. BGE-M3 Fine-Tuning Hyperparameters

BGE-M3 models were fine-tuned using the default set of hyperparameters from the FlagEmbedding library:

AdamW optimiser with a learning rate of 1 × 10^{$- 5$}.
Train group size of 4.
Batch size of 2.
Maximum length of 4096 sub-word tokens for query and passage.
Temperature 0.02.
20 epochs.
Precision fp16.

We used a single data centre GPU (NVIDIA A100).

Appendix D.5. Retrieval Model Hyperparameter Search

We ran a hyperparameter search for 1000 trials, searching for the best

t o p -k \in [10, 100]

, temperature

τ \in [0.01, 0.1]

, and

λ \in [0.1, 1.0]

parameters, maximising the subset accuracy.

Table A5. RAE-XMC Hyperparameter search results.

Model	Dataset	Top-k	$τ$	$λ$
BGE-M3	EURLEX57K	10	0.031	0.998
BGE-M3	NewsMon	16	0.091	0.999
FT-BGE	EURLEX57K	69	0.031	1.000
FT-BGE	NewsMon	13	0.098	0.816

Appendix E. Evaluation Metrics

In this section, we use N for the number of test samples, L for the number of labels,

y_{i}

for an individual example label set, and

\hat{y_{i}}

for the predicted example label set.

Appendix E.1. Example-Based Evaluation Metrics

For the example-based evaluation, we used subset accuracy:

A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} = \hat{y_{i}})

(A4)

where

I (t r u e) = 1

and

I (f a l s e) = 0

Appendix E.2. Label-Based Evaluation Metrics

For the label-based macro-averaged evaluation, metrics are undefined when true positives (TP), false negatives (FN), and false positives (FP) are all zero for a label; in such cases, we assign a value of zero to that label in the macro average. We used:

P r e c i s i o n_{m a c r o} = \frac{1}{L} \sum_{i = 1}^{L} \frac{T P_{i}}{T P_{i} + F P_{i}}

(A5)

R e c a l l_{m a c r o} = \frac{1}{L} \sum_{i = 1}^{L} \frac{T P_{i}}{T P_{i} + F N_{i}}

(A6)

For the label-based micro-averaged evaluation, we used:

P r e c i s i o n_{m i c r o} = \frac{\sum_{i = 1}^{L} T P_{i}}{\sum_{i = 1}^{L} (T P_{i} + F P_{i})}

(A7)

R e c a l l_{m i c r o} = \frac{\sum_{i = 1}^{L} T P_{i}}{\sum_{i = 1}^{L} (T P_{i} + F N_{i})}

(A8)

In both cases,

F_{1} -s c o r e

can be computed as follows:

F_{1} -s c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(A9)

References

Rupnik, J.; Muhic, A.; Leban, G.; Skraba, P.; Fortuna, B.; Grobelnik, M. News across languages-cross-lingual document similarity and event tracking. J. Artif. Intell. Res. 2016, 55, 283–316. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, F.; Shen, J.; Han, J. Unsupervised Key Event Detection from Massive Text Corpora. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Washington, DC, USA, 14–18 August 2022; pp. 2535–2544. [Google Scholar] [CrossRef]
Yoon, S.; Lee, D.; Zhang, Y.; Han, J. Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Taipei, Taiwan, 23–27 July 2023; pp. 802–811. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2020, arXiv:1910.10683. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Tarekegn, A.N.; Ullah, M.; Cheikh, F.A. Deep Learning for Multi-Label Learning: A Comprehensive Survey. arXiv 2024, arXiv:2401.16549. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripi, M.; Kauffmann, P.; et al. Phi-4 Technical Report. arXiv 2024, arXiv:2412.08905. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Subramanian, S.; Elango, V.; Gungor, M. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv 2025, arXiv:2501.05465. [Google Scholar]
Vajjala, S.; Shimangaud, S. Text Classification in the LLM Era—Where do we stand? arXiv 2025, arXiv:2502.11830. [Google Scholar]
Muralidharan, S.; Sreenivas, S.T.; Joshi, R.; Chochowski, M.; Patwary, M.; Shoeybi, M.; Catanzaro, B.; Kautz, J.; Molchanov, P. Compact Language Models via Pruning and Knowledge Distillation. arXiv 2024, arXiv:2407.14679. [Google Scholar] [CrossRef]
Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. arXiv 2023, arXiv:2306.08543. [Google Scholar]
Malinovskii, V.; Mazur, D.; Ilin, I.; Kuznedelev, D.; Burlachenko, K.; Yi, K.; Alistarh, D.; Richtarik, P. PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression. arXiv 2024, arXiv:2405.14852. [Google Scholar]
Kuzman, T.; Ljubešić, N. LLM Teacher-Student Framework for Text Classification with No Manually Annotated Data: A Case Study in IPTC News Topic Classification. IEEE Access 2025, 13, 35621–35633. [Google Scholar] [CrossRef]
Dasgupta, A.; Lamba, P.; Kushwaha, A.; Ravish, K.; Katyan, S.; Das, S.; Kumar, P. Review of Extreme Multilabel Classification. arXiv 2023, arXiv:2302.05971. [Google Scholar] [CrossRef]
Wang, Y.S.; Chang, W.C.; Jiang, J.Y.; Zhang, J.; Yu, H.F.; Vishwanathan, S.V.N. Retrieval-augmented Encoders for Extreme Multi-label Text Classification. arXiv 2025, arXiv:2502.10615. [Google Scholar]
Dai, X.; Chalkidis, I.; Darkner, S.; Elliott, D. Revisiting Transformer-based Models for Long Document Classification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7212–7230. [Google Scholar] [CrossRef]
Duan, L.; You, Q.; Wu, X.; Sun, J. Multilabel Text Classification Algorithm Based on Fusion of Two-Stream Transformer. Electronics 2022, 11, 2138. [Google Scholar] [CrossRef]
Liu, M.; Liu, L.; Cao, J.; Du, Q. Co-attention network with label embedding for text classification. Neurocomputing 2022, 471, 61–69. [Google Scholar] [CrossRef]
Yarullin, R.; Serdyukov, P. BERT for Sequence-to-Sequence Multi-label Text Classification. In Analysis of Images, Social Networks and Texts; van der Aalst, W.M.P., Batagelj, V., Ignatov, D.I., Khachay, M., Koltsova, O., Kutuzov, A., Kuznetsov, S.O., Lomazova, I.A., Loukachevitch, N., Napoli, A., et al., Eds.; Springer: Cham, Switzerland, 2021; pp. 187–198. [Google Scholar] [CrossRef]
Fallah, H.; Bruno, E.; Bellot, P.; Murisasco, E. Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers. In Proceedings of the ACM Symposium on Document Engineering 2023, DocEng ’23, Limerick, Ireland, 22–25 August 2023. [Google Scholar] [CrossRef]
Li, B.; Chen, Y.; Zeng, L. Kenet:Knowledge-Enhanced DOC-Label Attention Network for Multi-Label Text Classification. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11961–11965. [Google Scholar] [CrossRef]
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 66–71. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Park, H.; Vyas, Y.; Shah, K. Efficient Classification of Long Documents Using Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 702–709. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. arXiv 2020, arXiv:2007.14062. [Google Scholar]
Ding, M.; Zhou, C.; Yang, H.; Tang, J. CogLTX: Applying BERT to Long Texts. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Hierarchical Transformers for Long Document Classification. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 838–844. [Google Scholar] [CrossRef]
Jaiswal, A.; Milios, E. Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT. arXiv 2023, arXiv:2310.20558. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Cui, Y.; Jia, M.; Lin, T.; Song, Y.; Belongie, S.J. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 9268–9277. [Google Scholar] [CrossRef]
Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-Balanced Loss for Multi-label Classification in Long-Tailed Datasets. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 162–178. [Google Scholar]
Huang, Y.; Giledereli, B.; Köksal, A.; Özgür, A.; Ozkirimli, E. Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8153–8161. [Google Scholar] [CrossRef]
Piskorski, J.; Stefanovitch, N.; Da San Martino, G.; Nakov, P. SemEval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup. In Proceedings of the 17th International Workshop on Semantic Evaluation, SemEval’23, Toronto, Canada, 13–14 July 2023. [Google Scholar]
Liao, Q.; Lai, M.; Nakov, P. MarsEclipse at SemEval-2023 Task 3: Multi-Lingual and Multi-Label Framing Detection with Contrastive Learning. arXiv 2023, arXiv:2304.14339. [Google Scholar]
Reiter-Haas, M.; Ertl, A.; Innerhofer, K.; Lex, E. mCPT at SemEval-2023 Task 3: Multilingual Label-Aware Contrastive Pre-Training of Transformers for Few- and Zero-shot Framing Detection. arXiv 2023, arXiv:2303.09901. [Google Scholar]
Tunstall, L.; Reimers, N.; Jo, U.E.S.; Bates, L.; Korat, D.; Wasserblat, M.; Pereg, O. Efficient Few-Shot Learning Without Prompts. arXiv 2022, arXiv:2209.11055. [Google Scholar] [CrossRef]
Zheng, L.; Xiong, J.; Zhu, Y.; He, J. Contrastive Learning with Complex Heterogeneity. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Washington, DC, USA, 14–18 August 2022; pp. 2594–2604. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Y.; Long, D.; Xie, W.; Dai, Z.; Tang, J.; Lin, H.; Yang, B.; Xie, P.; Huang, F.; et al. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FlL, USA, 12–16 November 2024; pp. 1393–1412. [Google Scholar]
Sturua, S.; Mohr, I.; Akram, M.K.; Günther, M.; Wang, B.; Krimmel, M.; Wang, F.; Mastrapas, G.; Koukounas, A.; Koukounas, A.; et al. jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv 2024, arXiv:2409.10173. [Google Scholar]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 14–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
Dahiya, K.; Gupta, N.; Saini, D.; Soni, A.; Wang, Y.; Dave, K.; Jiao, J.; K, G.; Dey, P.; Singh, A.; et al. NGAME: Negative Mining-aware Mini-batching for Extreme Classification. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, Singapore, 27 February–3 March 2023; pp. 258–266. [Google Scholar] [CrossRef]
You, R.; Zhang, Z.; Wang, Z.; Dai, S.; Mamitsuka, H.; Zhu, S. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 5812–5822. [Google Scholar]
Chang, W.; Yu, H.; Zhong, K.; Yang, Y.; Dhillon, I.S. Taming Pretrained Transformers for Extreme Multi-label Text Classification. In Proceedings of the KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual, 23–27 August 2020; pp. 3163–3171. [Google Scholar]
Jiang, T.; Wang, D.; Sun, L.; Yang, H.; Zhao, Z.; Zhuang, F. LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification. arXiv 2021, arXiv:2101.03305. [Google Scholar] [CrossRef]
Zhang, R.; Wang, Y.S.; Yang, Y.; Yu, D.; Vu, T.; Lei, L. Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 1092–1106. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Gupta, N.; Khatri, D.; Rawat, A.S.; Bhojanapalli, S.; Jain, P.; Dhillon, I. Dual-encoders for Extreme Multi-label Classification. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Magueresse, A.; Carles, V.; Heetderks, E. Low-resource Languages: A Review of Past Work and Future Challenges. arXiv 2020, arXiv:2006.07264. [Google Scholar] [CrossRef]
Pakray, P.; Gelbukh, A.; Bandyopadhyay, S. Natural language processing applications for low-resource languages. Nat. Lang. Process. 2025, 31, 183–197. [Google Scholar] [CrossRef]
Chalkidis, I.; Fergadiotis, E.; Malakasiotis, P.; Androutsopoulos, I. Large-Scale Multi-Label Text Classification on EU Legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6314–6322. [Google Scholar] [CrossRef]
Xiong, L.; Xiong, C.; Li, Y.; Tang, K.F.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019; Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., Karypis, G., Eds.; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Cox, D.R. The Regression Analysis of Binary Sequences (with Discussion). J. R. Stat. Soc. B 1958, 20, 215–242. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Zhang, X.; Lane, N.D.; Xu, M. Small Language Models: Survey, Measurements, and Insights. arXiv 2024, arXiv:2409.15790. [Google Scholar] [CrossRef]
Enevoldsen, K.; Chung, I.; Kerboua, I.; Kardos, M.; Mathur, A.; Stap, D.; Gala, J.; Siblini, W.; Krzemiński, D.; Winata, G.I.; et al. MMTEB: Massive Multilingual Text Embedding Benchmark. arXiv 2025, arXiv:2502.13595. [Google Scholar] [CrossRef]
Jang, S.; Morabito, R. Edge-First Language Model Inference: Models, Metrics, and Tradeoffs. arXiv 2025, arXiv:2505.16508. [Google Scholar] [CrossRef]
Bucher, M.J.J.; Martini, M. Fine-Tuned ’Small’ LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv 2024, arXiv:2406.08660. [Google Scholar]
Galke, L.; Scherp, A.; Diera, A.; Karl, F.; Lin, B.X.; Khera, B.; Meuser, T.; Singhal, T. Are We Really Making Much Progress in Text Classification? A Comparative Review. arXiv 2022, arXiv:2204.03954. [Google Scholar]
Chang, T.A.; Arnett, C.; Tu, Z.; Bergen, B.K. When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages. arXiv 2023, arXiv:2311.09205. [Google Scholar] [CrossRef]
Gupta, V.; Chowdhury, S.P.; Zouhar, V.; Rooein, D.; Sachan, M. Multilingual Performance Biases of Large Language Models in Education. arXiv 2025, arXiv:2504.17720. [Google Scholar] [CrossRef]
Tarekegn, A.N.; Giacobini, M.; Michalak, K. A review of methods for imbalanced multi-label classification. Pattern Recognit. 2021, 118, 107965. [Google Scholar] [CrossRef]
Bernardini, F.C.; da Silva, R.B.; Rodovalho, R.M.; Meza, E.B.M. Cardinality and Density Measures and Their Influence to Multi-Label Learning Methods. Learn. Nonlinear Model. 2014, 12, 53–71. [Google Scholar] [CrossRef]
Raschka, S.; Patterson, J.; Nolet, C. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv 2020, arXiv:2002.04803. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]

Figure 1. Subword token distribution across all samples in the NewsMon.

Figure 2. The distribution of languages in all samples (Serbian, Slovene, Bosnian, Macedonian, Croatian, English, Albanian, Russian and Hungarian.

Figure 3. The distribution of selected industry sector labels across all samples.

Figure 4. Decision flow of the news monitoring acquisition processing pipeline. Each stage applies progressive filtering and finally human moderation to remove irrelevant data before output.

Figure 5. The distribution of labels per language (Serbian, Slovene, Bosnian, Macedonian, Croatian, English, Albanian, Russian and Hungarian).

Figure 6. The distribution of label occurrences in the NewsMon_sl dataset.

Figure 7. The distribution of label occurrences in the EURLEX57K dataset.

Figure 8. UMAP visualization of the NewsMon_sl dataset.

Figure 9. UMAP visualization of the EURLEX57K dataset.

Figure 10. The proposed methodology based on RAE-XMC (adapted from [22]).

Figure 11. Latency per sample for selected batch sizes.

Table 1. Examples of simplified Keyword-Boolean Expressions.

Keyword Expression	Explanation
tesla OR teslo OR tesli OR tesle –“nikol* tesl*”	Match the Tesla car brand if Nikola Tesla the person-phrase is not present in text
mercedes* –formul* –“max verstap*” –F1 …	Match the Mercedes car brand if Formula 1 terms and phrases are absent.
nlb* –“lig* nlb” –“nlb lig*”	Match all forms of NLB terms, but only if the NLB-sponsored league phrases are not present. (NLB is an acronym for the company Nova Ljubljanska Banka)
gradnj* –cest* –avtocest* –obvoznic*	Match the term construction without the terms: road, highway or ring road
“сенад* сoфтић” OR “senad softić*”	Match the person-phrase with various inflections and in various scripts

Note: The operator OR denotes a logical disjunction (any listed term). The minus sign (–) indicates negation (exclusion). Quotation marks (“ ”) define an exact phrase, and the asterisk (*) functions as a wildcard for variable character/sub-string endings.

Table 2. Dataset statistics for the complete set (NewsMon), selected initial set of labels (NewsMon_initial), the down-sampled dataset in Slovene (NewsMon_sl+dups) with all labels, and the down-sampled dataset with single-occurrence labels and duplicates removed (NewsMon_sl).

Dataset	Samples	Labels	Cardinality	Density	Diversity
NewsMon	1,068,261	12,305	3.392 ± 4.017	0.00028	267,564
NewsMon_initial	1,068,261	1960	2.213 ± 2.316	0.00114	183,361
NewsMon_sl+dups	62,049	4052	3.034 ± 3.492	0.00075	17,307
NewsMon_sl	50,784	3231	2.995 ± 3.463	0.00093	15,809
EURLEX57K	57,000	4271	5.069 ± 1.701	0.00119	34,982

Table 3. Best multilingual IR model specifications including supported token count, embedding dimension, parameter size, memory usage, and MTEB ranking metrics for multilingual (ML) and English (EN) tasks.

Model	Tokens	Dim	Params [M]	Mem [MB]	ML	EN
GTE-mb	8192	768	305	582	6	31
Jina-v3	8194	1024	572	1092	8	7
BGE-M3	8194	1024	568	2167	5	89

Table 4. Zero-shot performance on NewsMon_sl: micro/macro F1, precision, recall, and subset accuracy across all, frequent, and rare labels for GTE-mb, Jina-v3, and BGE-M3.

Model	Frequency	μF1	μP	μR	F1	P	R	Acc
GTE-mb	All	57.44	55.21	59.86	27.02	27.38	29.47	40.02
Jina-v3	All	58.55	56.27	61.03	27.58	27.87	30.11	40.94
BGE-M3	All	59.04	55.84	62.62	28.28	28.12	31.18	42.58
GTE-mb	Frequent	73.99	76.30	71.81	70.28	74.18	67.86	62.79
Jina-v3	Frequent	74.93	77.23	72.78	71.20	74.61	69.00	64.51
BGE-M3	Frequent	76.41	78.41	74.51	72.63	75.67	70.63	65.77
GTE-mb	Rare	50.59	72.85	38.75	13.89	14.75	13.54	30.75
Jina-v3	Rare	50.68	72.24	39.03	13.90	14.71	13.73	30.54
BGE-M3	Rare	52.59	71.70	41.53	14.72	15.52	14.45	33.12

Note: Best results are shown in bold.

Table 5. Zero-shot performance on EURLEX57K: micro/macro F1, precision, recall, and subset accuracy across all, frequent, and rare labels for GTE-mb, Jina-v3, and BGE-M3.

Model	Frequency	μF1	μP	μR	F1	P	R	Acc
GTE-mb	All	45.32	45.47	45.17	14.73	15.92	15.40	12.39
Jina-v3	All	60.67	60.75	60.58	22.12	23.44	23.01	18.89
BGE-M3	All	66.52	66.67	66.37	25.65	26.74	26.76	23.12
GTE-mb	Frequent	57.00	57.71	56.30	52.95	54.20	52.19	24.14
Jina-v3	Frequent	72.45	72.76	72.14	68.95	69.30	68.86	35.65
BGE-M3	Frequent	77.45	78.00	76.91	74.51	75.34	73.97	40.68
GTE-mb	Rare	18.29	38.41	12.00	3.71	4.30	3.49	9.52
Jina-v3	Rare	27.25	41.32	20.32	6.08	6.68	6.02	15.63
BGE-M3	Rare	34.14	47.32	26.70	8.10	8.81	8.00	20.60

Note: Best results are shown in bold.

Table 6. Observed memory usage during inference on a single consumer-grade GPU. Values are means across 10 trials per batch; standard deviations were negligible.

Metric	Value
Model steady-state GPU memory	1147.4 MB
Peak GPU memory @ batch size $= 2$	2645.0 MB
Max allocated GPU memory @ batch size $= 384$	7578.7 MB
Peak host process memory	2940.6 MB
Knowledge memory (host)	748.5 MB

Table 7. NewsMon_sl (all labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	68.13	80.26	59.19	28.41	34.00	26.26	44.88
LogReg	OVA-TFIDF	67.47	83.34	56.68	26.76	33.39	24.21	45.22
XLMR	baseline	53.50	85.21	39.17	4.70	6.81	3.99	38.02
BGE-M3	zshot	59.04	55.84	62.62	28.28	28.12	31.18	42.58
BGE-M3	ML-KNN	42.94	77.94	29.63	5.92	9.77	4.78	29.63
BGE-M3	RAE-XMC	65.58	76.11	57.61	28.18	32.61	26.88	44.49
FT-BGE	zshot	68.93	67.20	70.75	31.21	31.71	33.47	51.38
FT-BGE	ML-KNN	62.18	86.48	48.54	9.30	13.47	7.89	46.67
FT-BGE	RAE-XMC	73.67	86.07	64.39	29.24	34.64	27.27	56.12

Note: Best results are shown in bold, and second-best results are underlined.

Table 8. EURLEX57K (all labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	73.21	82.92	65.54	25.65	30.98	23.58	25.58
LogReg	OVA-TFIDF	71.61	82.07	63.51	23.84	29.38	21.63	22.34
XLMR	baseline	75.98	93.18	64.14	13.95	18.83	12.19	24.25
BGE-M3	zshot	66.52	66.67	66.37	25.65	26.74	26.76	23.12
BGE-M3	ML-KNN	57.68	81.49	44.63	9.63	14.50	8.03	11.77
BGE-M3	RAE-XMC	69.72	74.63	65.41	25.73	28.69	25.11	23.47
FT-BGE	zshot	68.37	68.40	68.33	26.81	27.95	28.00	25.58
FT-BGE	ML-KNN	59.96	82.54	47.08	10.55	15.51	8.92	12.84
FT-BGE	RAE-XMC	69.99	72.35	67.77	27.12	29.42	27.25	25.91

Note: Best results are shown in bold, and second-best results are underlined.

Table 9. NewsMon_sl (frequent labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	81.40	88.43	75.40	77.41	86.29	70.87	67.35
LogReg	OVA-TFIDF	81.55	90.54	74.18	77.23	88.48	69.34	67.79
XLMR	baseline	82.65	90.24	76.29	77.24	86.99	70.98	70.26
BGE-M3	zshot	76.41	78.41	74.51	72.63	75.67	70.63	65.77
BGE-M3	ML-KNN	66.15	85.25	54.05	56.36	81.71	45.97	50.18
BGE-M3	RAE-XMC	78.96	87.44	71.97	74.20	85.08	67.13	66.24
FT-BGE	zshot	86.71	87.39	86.05	83.42	84.89	82.60	78.09
FT-BGE	ML-KNN	86.22	91.56	81.47	81.62	89.18	76.55	76.36
FT-BGE	RAE-XMC	88.31	93.59	83.59	84.51	91.35	79.40	79.64

Note: Best results are shown in bold, and second-best results are underlined.

Table 10. EURLEX57K (frequent labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	82.26	87.24	77.82	79.74	85.40	75.62	46.56
LogReg	OVA-TFIDF	80.63	85.81	76.05	77.88	83.99	73.33	43.10
XLMR	baseline	90.18	94.15	86.54	87.56	93.39	83.95	62.79
BGE-M3	zshot	77.45	78.00	76.91	74.51	75.34	73.97	40.68
BGE-M3	ML-KNN	72.46	84.55	63.39	65.64	80.44	58.57	29.04
BGE-M3	RAE-XMC	80.02	83.06	77.20	76.87	80.46	74.11	43.30
FT-BGE	zshot	79.02	79.53	78.52	76.35	76.99	75.93	43.59
FT-BGE	ML-KNN	74.29	85.31	65.79	68.49	81.49	61.66	31.65
FT-BGE	RAE-XMC	80.05	81.71	78.46	77.40	79.32	75.85	44.86

Note: Best results are shown in bold, and second-best results are underlined.

Table 11. NewsMon_sl (rare labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	49.12	90.15	33.76	11.81	12.65	11.42	27.51
LogReg	OVA-TFIDF	45.06	92.51	29.79	10.55	11.32	10.18	24.24
XLMR	baseline	0.00	0.00	0.00	0.00	0.00	0.00	0.00
BGE-M3	zshot	52.59	71.70	41.53	14.72	15.52	14.45	33.12
BGE-M3	ML-KNN	0.00	0.00	0.00	0.06	0.06	0.06	0.00
BGE-M3	RAE-XMC	50.20	90.58	34.72	12.53	13.45	12.09	27.96
FT-BGE	zshot	55.00	73.66	43.89	15.06	15.64	15.15	34.84
FT-BGE	ML-KNN	0.00	0.00	0.00	0.06	0.06	0.06	0.00
FT-BGE	RAE-XMC	48.30	92.89	32.64	11.93	12.88	11.46	26.02

Note: Best results are shown in bold, and second-best results are underlined.

Table 12. EURLEX57K (rare labels): micro/macro F1, P, R, and subset accuracy for SVM and logistic regression (OvA TF-IDF), XLM-RoBERTa baseline, BGE-M3, and fine-tuned BGE-M3 under zero-shot, ML-KNN, and RAE-XMC.

Model	Method	μF1	μP	μR	F1	P	R	Acc
SVM	OVA-TFIDF	26.57	72.19	16.28	4.95	5.43	4.78	14.07
LogReg	OVA-TFIDF	30.19	78.28	18.70	5.57	6.15	5.33	18.76
XLMR	baseline	0.52	0.00	0.26	0.05	0.08	0.04	0.28
BGE-M3	zshot	34.14	47.32	26.70	8.10	8.81	8.00	20.60
BGE-M3	ML-KNN	0.00	0.00	0.00	0.00	0.00	0.00	0.00
BGE-M3	RAE-XMC	33.38	57.87	23.46	7.27	7.94	7.07	18.04
FT-BGE	zshot	36.21	50.19	28.32	8.40	9.19	8.24	22.59
FT-BGE	ML-KNN	0.00	0.00	0.00	0.00	0.00	0.00	0.00
FT-BGE	RAE-XMC	36.48	57.58	26.70	7.97	8.81	7.78	21.88

Note: Best results are shown in bold, and second-best results are underlined.

Table 13. Factor attribution (absolute %-point deltas) for the effects of hard-negative fine-tuning (HN), RAE-XMC retrieval, and their interaction across datasets and label subsets.

Dataset/Split	ΔμF1 HN	ΔμF1 RAE	ΔμF1 Int.	ΔμF1 Total	ΔAcc HN	ΔAcc RAE	ΔAcc Int.	ΔAcc Total
NewsMon_sl (All)	+9.89	+6.54	−1.80	+14.63	+8.80	+1.91	+2.83	+13.54
NewsMon_sl (Frequent)	+10.30	+2.55	−0.95	+11.90	+12.32	+0.47	+1.08	+13.87
NewsMon_sl (Rare)	+2.41	−2.39	−4.31	−4.29	+1.72	−5.16	−3.66	−7.10
EURLEX57K (All)	+1.85	+3.20	−1.58	+3.47	+2.46	+0.35	−0.02	+2.79
EURLEX57K (Frequent)	+1.57	+2.58	−1.55	+2.60	+2.91	+2.63	−1.35	+4.18
EURLEX57K (Rare)	+2.07	−0.75	+1.03	+2.34	+1.99	−2.56	+1.85	+1.28

Note: “HN” = hard-negative fine-tuning effect (FT-BGE zero-shot – BGE-M3 zero-shot); “RAE” = RAE-XMC retrieval/predictor effect (BGE-M3 RAE – BGE-M3 zero-shot); “Int.” = interaction term explaining residual; “Total” = overall delta from BGE-M3 zero-shot → FT-BGE RAE-XMC. Values derived from Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ivačič, N.; Škrlj, B.; Koloski, B.; Pollak, S.; Lavrač, N.; Purver, M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Mach. Learn. Knowl. Extr. 2025, 7, 142. https://doi.org/10.3390/make7040142

AMA Style

Ivačič N, Škrlj B, Koloski B, Pollak S, Lavrač N, Purver M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction. 2025; 7(4):142. https://doi.org/10.3390/make7040142

Chicago/Turabian Style

Ivačič, Nikola, Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač, and Matthew Purver. 2025. "Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned" Machine Learning and Knowledge Extraction 7, no. 4: 142. https://doi.org/10.3390/make7040142

APA Style

Ivačič, N., Škrlj, B., Koloski, B., Pollak, S., Lavrač, N., & Purver, M. (2025). Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction, 7(4), 142. https://doi.org/10.3390/make7040142

Article Menu

Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned

Abstract

1. Introduction

2. Related Work

3. Data Description

3.1. Data Preparation

3.2. Data Analysis

4. Methodology

5. Experimental Setting

Inference Efficiency (Latency & Memory)

6. Results

6.1. Classification Performance Analysis

6.2. Factor Attribution Analysis

7. Conclusions: Summary of Advances and Lessons Learned

8. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Label Distribution Metrics

Appendix B. Serbian and Macedonian Language Preliminary Research

Appendix B.1. Data Analysis

Appendix B.2. Preliminary Experimental Results

Appendix C. The Original RAE-XMC Framework

Appendix D. Hyperparameters

Appendix D.1. SVM Hyperparameters

Appendix D.2. Logistic Regression Hyperparameters

Appendix D.3. XLM-RoBERTa-Base Fine-Tuning Hyperparameters

Appendix D.4. BGE-M3 Fine-Tuning Hyperparameters

Appendix D.5. Retrieval Model Hyperparameter Search

Appendix E. Evaluation Metrics

Appendix E.1. Example-Based Evaluation Metrics

Appendix E.2. Label-Based Evaluation Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI