Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsPaper Summary: This paper tackles extreme multi-label text classification (XMC) for less-represented languages and low-resource deployments, motivated by real-world media monitoring needs (low latency, dynamic labels, security). The authors curate a large multilingual industry dataset (NewsMon) and work with a down-sampled Slovene subset (NewsMonsl) to accelerate experimentation. NewsMonsl contains 50,784 samples and 3,231 labels with a pronounced long-tail distribution; preparation removes near-duplicates (~18%) and uses Jan–May 2023 data with an 80/10/10 split. They compare against EURLEX57K to assess transfer to a high-resource English legal corpus.
Experiments benchmark classic OVA TF-IDF, XLM-RoBERTa, and retrieval models (BGE-M3; fine-tuned FT-BGE) under zero-shot, ML-KNN, and RAE-XMC settings across all / frequent / rare labels. On NewsMonsl (all labels), FT-BGE + RAE-XMC obtains the top metrics (e.g., µF1 73.67, subset accuracy 56.12), and non-fine-tuned retrieval also beats the XLM-R baseline on most metrics. On EURLEX57K (all labels), micro-F1 and precision are below the XLM-R baseline, though macro-F1 and subset accuracy improve with the retrieval setup; retrieval shines particularly for rare labels.
Strength: The paper focuses on Slovene media monitoring and low-resource constraints with practical deployment considerations (latency, hardware). The paper is well-written and well-organized. There is no English mistaken in this paper.
Weakness: The related work cites LightXML and dual-encoder XMC, but these are not included as baselines; adding a modern dual-encoder or XR-Transformer-style line would sharpen claims.
Conclusion: As a researcher from machine learning background, this is all the suggestions and comments I can provided. I think my review may be limited. In general, I will recommend to accept this paper after minor revision. Please add some baseline model for comparison, as I pointed in the weakness.
Author Response
Thank you for the careful review, the constructive suggestion to add modern XMC baselines, and the recommendation to accept with minor revision. We address your main point below and list the concrete changes we have made to the manuscript.
Comment 1: Weakness: The related work cites LightXML and dual-encoder XMC, but these are not included as baselines; adding a modern dual-encoder or an XR-Transformer-style line would sharpen the claims.
Response 1:
- We fully agree and plan to include an SOTA XMC line (e.g., LightXML, XR‑Transformer/dual‑encoders) in our next iteration. We actively discuss these families and their promise for tail labels in our Related Work (X‑/LightXML, DEPL, and recent dual encoders) and in the conclusions, where we explicitly point to adopting a true dual encoder next.
- Time constraint. We had five days during the rebuttal window to validate our LightXML implementation, initialised from an XLM-R transformer encoder, and to fine-tune the model end-to-end; we judged this too risky to report without proper validation.
- Why did we not add it in this round? On EURLEX57K, our retrieval setup did not surpass the XLM-R baseline on micro-F1/precision (e.g., µF1 69.99 vs. 75.98), although it improved macro-F1 and subset accuracy (e.g., subset accuracy 25.91 vs. 24.25). We therefore opted to retain, for now, the simplest Transformer encoder classifier as the baseline rather than introduce a partially validated LightXML line. See Section 6, Table 7 (p. 15).
- Design objective. One of our primary objectives is to prioritise methods that do not require frequent end-to-end fine-tuning, given the continual label changes in production monitoring. This motivates our focus on retrieval‑augmented classification, which remains stable under dynamic label spaces (Introduction, p. 2; Conclusions, p. 18).
Additional clarification: Revisions made to the manuscript:
- We have condensed and refocused the Abstract: the background and deployment context are now summarised in one sentence, with the remainder devoted to the problem setting, method, datasets, and headline findings (e.g., the superiority of retrieval-based XMC on NewsMonsl and its robustness for rare labels compared to baseline). The trimmed Abstract appears on p. 1.:
Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation. - Clearer motivation in the Introduction section (p. 2).
We rewrote the second paragraph to make the deployment constraints and our problem framing explicit. The paragraph now reads:
However, the operational reality of media monitoring—processing very large daily article volumes under strict latency and cost budgets—renders general-purpose LLMs ill-suited for sustained production use. High per-token inference costs, accelerator dependence, and third-party API constraints (including privacy, availability, and rate limits) conflict with industry requirements for low-latency, on-premises, and cost-predictable pipelines. These constraints are amplified for less-represented languages, where LLM coverage and training data are limited, and for extreme multi-label settings with frequent label changes, where continuous fine-tuning is infeasible. This motivates our focus on efficient, retrieval-based MLC that operates on consumer-grade hardware while maintaining robustness across multilingual and low-resourced contexts. (Introduction, p. 2). - Streamlined related work citations (p. 2).
We rewrote the fourth paragraph of the Introduction to trim excessive citations, maintaining a concise and recent set that directly supports our claims about encoder vs. generative trends and efficiency trade-offs (p. 2). - More detailed algorithmic description.
Section 4 (p. 12) now provides a more comprehensive presentation of our simplified RAE-XMC algorithm variant. - Performance benchmarks added and organised.
We added Section 5.1: “Inference Efficiency (Latency & Memory)” with a per‑sample latency measurement over batch sizes and a device/host memory audit on a single 16 GB consumer‑grade GPU (Figure 11 and Table 6). - Strengthen Experimental Analyses.
We added a new sub-section Factor Attribution Analysis in our Results section (p. 19) where we decompose which factor of our method contributes the most and how they interact. - Additional South Slavic languages (data analysis).
We have added dataset analyses for Serbian and Macedonian in Appendix B.1 (p. 20), including statistics (Table B1) and label-occurrence distributions (Figure B1), to document the cross-language characteristics of the NewsMon collection. - Preliminary cross‑lingual results.
We included preliminary results for Serbian and Macedonian in the appendix to indicate feasibility; these are preliminary and currently lack the strongest baselines for a fair head‑to‑head comparison. (Appendix B, p. 20.) We note this limitation explicitly and frame it as follow‑up work. - More task-targeted conclusion.
We replaced the paragraph concerning language bias in the Conclusion section (p. 19):
Notably, the method is language-agnostic and thus applicable to underrepresented or low-resource languages such as Slovene and Serbian. However, we acknowledge that pre-trained language models inherently carry the perspectives of dominant linguistic and cultural groups present in pre-training corpora. Nevertheless, in our setting, where the primary use case involves similarity search within a given language's knowledge memory, cultural or social biases are primarily contained within that language. As a result, given the multilingual encoder model selection and retrieval-based approach, it is less likely for bias to amplify or transfer across languages. Unlike tasks such as sentiment analysis or hate speech detection, which are sensitive to sociocultural nuance, our method tends to preserve or even attenuate existing biases within the original language boundaries.
and the paragraph concerning cross-lingual capabilities in the Conclusion section (p. 19):
Furthermore, although we did not conduct experiments in a cross-lingual setting, we believe the proposed method could be effectively extended to such scenarios, assuming the retrieval model has strong cross-lingual capabilities and is fine-tuned with hard negatives drawn from datasets across all target languages, especially when languages from the same family are considered. However, this assumption may not hold for highly divergent low-resource languages like Sanskrit or Hindi. These languages can have very different grammatical structures, word orders, or morphological systems compared to high-resource languages (like English, French, or German), which are usually the primary focus of NLP models and datasets.
With the following two:
Notably, the proposed method is language-agnostic and thus applicable to less-represented or low-resource languages, such as Slovene and Serbian, as the underlying retrieval models were not specifically fine-tuned on these languages. Although we did not conduct cross-lingual experiments, we believe the approach can be effectively extended to cross-lingual settings, provided the retriever exhibits strong cross-lingual alignment and is fine-tuned with hard negatives drawn from datasets across all target languages; such transfer is likely to be especially effective for languages within the same family (see Appendix B of preliminary research).
Additionally, our information retrieval approach enhances scalability further through the use of knowledge memory embeddings per language and model. This compact embedding footprint enables the system to efficiently support numerous languages in parallel or even with a single model. Importantly, these knowledge memory embeddings can be offloaded to external retrieval systems when needed, allowing for flexible deployment on various hardware configurations and seamless integration with scalable, distributed retrieval infrastructures.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors address a relevant problem in multilingual NLP, namely performing exreme multi-label text classification in low-resurce and less-represented language contexts, particularly for application such as media monitoring. The paper generally reads well, but it does have some limitations:
- despite the focus on less-represented languages (plural), the experiments are limited to Slovene, with only indirect claims about the method's generality to othe rslavic languages. No cross-lingual or multilingual evaluations are presented
- the fine-tuning settings for XLM-R are minimal and not optimized, which limits the comparison, particularly since RAE-XMC models undergo parameter optimization and hard-negative fine-tuning.
- the authors emphasize computational efficiency and low-latency suitability but do not report actual inference times, memory benchmarks, or throughput metrics.
- the conclusion section acknowledges that results are “inconclusive,” with gains observed only in Slovene but not in English (EURLEX57K). It remains unclear whether the observed improvements are due to the retrieval mechanism, fine-tuning with hard negatives, or dataset-specific artifacts.
Author Response
We sincerely thank the reviewer for the careful reading and constructive suggestions. Below, we address each point in turn and indicate concrete changes in the revised manuscript, with page- and section-level pointers.
Comment 1: despite the focus on less-represented languages (plural), the experiments are limited to Slovene, with only indirect claims about the method's generality to othe rslavic languages. No cross-lingual or multilingual evaluations are presented
Response 1: We agree that empirical evidence beyond the Slovene language strengthens the paper. In the revision, we added Appendix B with data analyses and preliminary results for Serbian and Macedonian, for which we could repeat data filtering, resampling and experiment repeating, given the limited rebuttal time:
-
- Appendix B.1 (pp. 21–22): dataset statistics and long‑tail analyses for Serbian and Macedonian (Table B1; Figure B1).
- Appendix B.2 (pp. 22–23): preliminary retrieval‑based results on Serbian (Table B2) and Macedonian (Table B3). As noted in the appendix, the XLM-R baseline is not yet included for these languages. The FT-BGE_sl retriever was transferred from Slovene and not fine-tuned on the target languages; therefore, these results should be interpreted as cross-lingual transfer results rather than language-specific optimisations.
- Appendix B.1 (pp. 21–22): dataset statistics and long‑tail analyses for Serbian and Macedonian (Table B1; Figure B1).
These additions partially address the reviewer’s concern by demonstrating that our retrieval-first pipeline yields increased performance without target-language fine-tuning. We also clarify in the Data section that language-specific label spaces in our industrial setting limit direct cross-lingual application, which motivated per-language evaluations and careful interpretation of transfer (p. 7, Section 3.1).
Comment 2: the fine-tuning settings for XLM-R are minimal and not optimized, which limits the comparison, particularly since RAE-XMC models undergo parameter optimization and hard-negative fine-tuning.
Response 2: We aimed to reflect realistic low-resource constraints and maintain a simple, reproducible encoder baseline. We therefore did not perform hyperparameter optimisation for XLM‑R and trained it with a standard configuration.
Despite the minimal XLMR tuning, our baseline on EURLEX57K achieves µF1 of 75.98, which exceeds the original authors’ report (0.732) and serves as a competitive reference for our retrieval methods (Tables 7–8; Section 6).
We agree that a stronger encoder baseline (with tuned hyperparameters or a modern XMC encoder) would further sharpen the conclusions; we flag this as planned work. However, for this submission, we keep the baseline deliberately simple to align with the deployment constraints emphasised in the paper.
Comment 3: the authors emphasize computational efficiency and low-latency suitability but do not report actual inference times, memory benchmarks, or throughput metrics.
Response 3: We added Section 5.1 (p. 15): “Inference Efficiency (Latency & Memory)” with a per‑sample latency measurement over batch sizes and a device/host memory audit on a single 16 GB consumer‑grade GeForce 4070Ti RTX Super GPU (added Figure 11 and Table 6):
-
- Latencies (per‑sample): from ~124 ms at batch 2 to ~46–48 ms for large batches, measured over 10 trials per batch (Figure 11).
- Memory: steady‑state GPU 1,147.4 MB; peak 2,645.0 MB at batch 2; max allocated CUDA 7,578.7 MB at batch 384; host process 2,940.6 MB; knowledge memory (host‑resident index) 748.5 MB (Table 6).
We also report that latency and memory exhibit negligibly small standard deviations, indicating stable and repeatable behaviour (p. 15). These numbers, together with the per‑sample latencies, enable direct conversion to throughput if desired.
Comment 4: the conclusion section acknowledges that results are “inconclusive,” with gains observed only in Slovene but not in English (EURLEX57K). It remains unclear whether the observed improvements are due to the retrieval mechanism, fine-tuning with hard negatives, or dataset-specific artifacts.
Response 4: We agree that attribution must be explicit, and the revised manuscript now separates the effects of:
- hard-negative (HN) fine-tuning of the retriever and
- the RAE-XMC retrieval/predictor.
Concretely, added new Section 6.2 (pp. 18-19) adds a factor-attribution table and narrative showing that on NewsMonsl the primary driver is HN fine-tuning (+9.89 pp micro-F1; +8.80 pp subset accuracy vs. the same encoder in zero-shot), while RAE-XMC adds a smaller, precision-oriented boost (+6.54 pp micro-F1; +1.91 pp accuracy) but can reduce tail recall—consistent with softmax-weighted neighbour aggregation favouring head labels (Table 13; Section 6.2).
Additional manuscript improvements:
- We have condensed and refocused the Abstract: the background and deployment context are now summarised in one sentence, with the remainder devoted to the problem setting, method, datasets, and headline findings (e.g., the superiority of retrieval-based XMC on NewsMonsl and its robustness for rare labels compared to baseline). The trimmed Abstract appears on p. 1.:
Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation. - Motivation & deployment constraints.
We revised the second paragraph of the Introduction (p. 2) to explain why we focus on efficient, retrieval-based MLC under latency, privacy, and cost constraints, and why this is especially pertinent for less-represented languages and frequently changing label spaces.
We also streamlined the surrounding related work citations to include only the most recent and directly relevant references. - Method clarity. We added a concise pseudocode description for the modified RAE-XMC inference (Algorithm 1, p.12) and moved full optimisation details to Appendix D to improve readability.
- More task-targeted conclusion.
We replaced the paragraph concerning language bias in the Conclusion section (p. 19):
Notably, the method is language-agnostic and thus applicable to underrepresented or low-resource languages such as Slovene and Serbian. However, we acknowledge that pre-trained language models inherently carry the perspectives of dominant linguistic and cultural groups present in pre-training corpora. Nevertheless, in our setting, where the primary use case involves similarity search within a given language's knowledge memory, cultural or social biases are primarily contained within that language. As a result, given the multilingual encoder model selection and retrieval-based approach, it is less likely for bias to amplify or transfer across languages. Unlike tasks such as sentiment analysis or hate speech detection, which are sensitive to sociocultural nuance, our method tends to preserve or even attenuate existing biases within the original language boundaries.
And the paragraph concerning cross-lingual capabilities in the Conclusion section (p. 19):
Furthermore, although we did not conduct experiments in a cross-lingual setting, we believe the proposed method could be effectively extended to such scenarios, assuming the retrieval model has strong cross-lingual capabilities and is fine-tuned with hard negatives drawn from datasets across all target languages, especially when languages from the same family are considered. However, this assumption may not hold for highly divergent low-resource languages like Sanskrit or Hindi. These languages can have very different grammatical structures, word orders, or morphological systems compared to high-resource languages (like English, French, or German), which are usually the primary focus of NLP models and datasets.
With the following two:
Notably, the proposed method is language-agnostic and thus applicable to less-represented or low-resource languages, such as Slovene and Serbian, as the underlying retrieval models were not specifically fine-tuned on these languages. Although we did not conduct cross-lingual experiments, we believe the approach can be effectively extended to cross-lingual settings, provided the retriever exhibits strong cross-lingual alignment and is fine-tuned with hard negatives drawn from datasets across all target languages; such transfer is likely to be especially effective for languages within the same family (see Appendix B of preliminary research).
Additionally, our information retrieval approach enhances scalability further through the use of knowledge memory embeddings per language and model. This compact embedding footprint enables the system to efficiently support numerous languages in parallel or even with a single model. Importantly, these knowledge memory embeddings can be offloaded to external retrieval systems when needed, allowing for flexible deployment on various hardware configurations and seamless integration with scalable, distributed retrieval infrastructures.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes a retrieval-based approach to multi-label text classification for a media monitoring dataset, focusing on less-represented languages. To improve the quality of this manuscript, my comments and suggestions are listed as follows.
- The content of Abstract is too long, especially for the research background and motivation parts.
- Excessive citation of references exists for the paragraph between line 44 and line 57 in the Page 2. Authors use the references [5-35]. There are 31 references in total for this short paragraph.
- Why do authors study the multi-label text classification for less-represented languages and low-resourced environments? The research background and motivation should be enhanced in Section Introduction.
- For Section 4, authors are suggested to describe their proposed algorithm with detailed steps.
- For the tables, the captions are very long. Authors should make them concise. They can describe the table content in detail as text in the manuscript.
- The experimental analyses are very weak. Authors only offer lot of experimental data. More experimental analyses are expected.
- Multilingual experimentation is Insufficient. Currently, authors mainly focus on Slovene. Authors are recommended to add the preliminary experiments with other more languages to validate the cross-lingual generalization capability of the method.
- The content of Section 7 is too long. Generally speaking, there is only one paragraph without references.
Author Response
We thank the reviewer for their careful reading and constructive suggestions. Below, we address each point and indicate the concrete revisions made in the updated manuscript.
Comment 1: The content of Abstract is too long, especially for the research background and motivation parts.
Response 1: We have condensed and refocused the Abstract: the background and deployment context are now summarised in one sentence, with the remainder devoted to the problem setting, method, datasets, and headline findings (e.g., the superiority of retrieval-based XMC on NewsMonsl and its robustness for rare labels compared to baseline). The trimmed Abstract appears on p. 1.:
Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation.
Comment 2: Excessive citation of references exists for the paragraph between line 44 and line 57 in the Page 2. Authors use the references [5-35]. There are 31 references in total for this short paragraph.
Response 2: We rewrote the relevant part of the Introduction. We substantially reduced the number of citations, keeping only recent and representative works that support our claims about encoder vs. generative trends and the constraints of smaller LMs. See the third paragraph of the Introduction on p. 2.
Comment 3: Why do authors study the multi-label text classification for less-represented languages and low-resourced environments? The research background and motivation should be enhanced in Section Introduction.
Response 3: We strengthened the motivation in the Introduction by adding a deployment-focused paragraph that explicitly links media-monitoring constraints (latency, cost, on-premises privacy) to the need for retrieval-based MLC, which avoids frequent end-to-end fine-tuning and runs on consumer-grade hardware. See the second paragraph of the Introduction on p. 2.:
However, the operational reality of media monitoring—processing very large daily article volumes under strict latency and cost budgets—renders general-purpose LLMs ill-suited for sustained production use. High per-token inference costs, accelerator dependence, and third-party API constraints (including privacy, availability, and rate limits) conflict with industry requirements for low-latency, on-premises, and cost-predictable pipelines. These constraints are amplified for less-represented languages, where LLM coverage and training data are limited, and for extreme multi-label settings with frequent label changes, where continuous fine-tuning is infeasible. This motivates our focus on efficient, retrieval-based MLC that operates on consumer-grade hardware while maintaining robustness across multilingual and low-resourced contexts.
Comment 4: For Section 4, authors are suggested to describe their proposed algorithm with detailed steps.
Response 4: We have added the step-by-step pseudocode of the inference procedure (Algorithm 1: Modified RAE-XMC Inference), clarifying the inputs, temperature-scaled retrieval weight, the value matrix V, and the threshold. See Section 4, p. 12.
Comment 5: For the tables, the captions are very long. Authors should make them concise. They can describe the table content in detail as text in the manuscript.
Response 5: We condensed all table captions to a single, concise sentence that identifies the dataset, slice, metric family, and methods; detailed explanations (e.g., frequency buckets, metric definitions) were relocated to the main text.
Comment 6: The experimental analyses are very weak. Authors only offer lot of experimental data. More experimental analyses are expected.
Response 6: We added a new sub-section, Factor Attribution Analysis, in our Results section (p. 19), where we decompose which factor of our method contributes the most and how they interact.
Additionally, to support our claims regarding suitability for low-latency deployments, we have included an inference efficiency subsection with latency and memory measurements on a single consumer-grade GPU (see Figure 11 and Table 6 in Section 5.1, p. 15).
Comment 7: Multilingual experimentation is Insufficient. Currently, authors mainly focus on Slovene. Authors are recommended to add the preliminary experiments with other more languages to validate the cross-lingual generalization capability of the method.
Response 7: We have added Appendix B, which includes dataset analyses and preliminary results for two additional South Slavic languages (Serbian and Macedonian). The appendix reports statistics (Table B1), label-occurrence distributions (Figure B1), and retrieval-based baselines (Tables B2–B3), as well as factor attribution (Table B3). To make the scope transparent, we clearly note that these are cross-lingual transfers of the Slovene-tuned retriever (FT-BGEsl) and do not yet include transformer baselines; they therefore illustrate feasibility rather than definitive comparisons. See Appendix B, pp. 22–23.
Comment 8: The content of Section 7 is too long. Generally speaking, there is only one paragraph without references.
Response 8: We have shortened section seven, replacing two paragraphs and reducing the number of references:
Notably, the method is language-agnostic and thus applicable to underrepresented or low-resource languages such as Slovene and Serbian. However, we acknowledge that pre-trained language models inherently carry the perspectives of dominant linguistic and cultural groups present in pre-training corpora~\citep{nie-etal-2024-multilingual}. Nevertheless, in our setting, where the primary use case involves similarity search within a given language's knowledge memory, cultural or social biases are primarily contained within that language. As a result, given the multilingual encoder model selection and retrieval-based approach, it is less likely for bias to amplify or transfer across languages. Unlike tasks such as sentiment analysis or hate speech detection, which are sensitive to sociocultural nuance, our method tends to preserve or even attenuate existing biases within the original language boundaries.
And
Furthermore, although we did not conduct experiments in a cross-lingual setting, we believe the proposed method could be effectively extended to such scenarios, assuming the retrieval model has strong cross-lingual capabilities and is fine-tuned with hard negatives drawn from datasets across all target languages, especially when languages from the same family are considered. However, this assumption may not hold for highly divergent low-resource languages like Sanskrit or Hindi. These languages can have very different grammatical structures, word orders, or morphological systems compared to high-resource languages (like English, French, or German), which are usually the primary focus of NLP models and datasets.
With a single one:
Notably, the proposed method is language-agnostic and thus applicable to less-represented or low-resource languages, such as Slovene and Serbian, as the underlying retrieval models were not specifically fine-tuned on these languages. Although we did not conduct cross-lingual experiments, we believe the approach can be effectively extended to cross-lingual settings, provided the retriever exhibits strong cross-lingual alignment and is fine-tuned with hard negatives drawn from datasets across all target languages; such transfer is likely to be especially effective for languages within the same family (see Appendix B of preliminary research).
We hope these revisions improve clarity, balance, and reproducibility.
