1. Introduction
The rapid proliferation of digital communication platforms has fundamentally transformed the way societies interact, debate, and share information. While social media, online news portals, and public forums have democratized access to public discourse, they have simultaneously become vectors for the dissemination of hate speech—language that attacks, demeans, or incites violence against individuals or groups on the basis of protected characteristics such as race, ethnicity, religion, gender, sexual orientation, or national origin [
1,
2]. The scale and urgency of this phenomenon are staggering: a 2023 global survey conducted by UNESCO and Ipsos across 16 countries found that 67% of Internet users have personally encountered hate speech online, with the prevalence rising to 74% among users under the age of 35 [
3]. Similarly, a European Union survey has reported that approximately 80% of respondents in the EU have encountered hate speech in online environments [
4]. These figures underscore that online hate speech is not a marginal or isolated problem but a pervasive feature of the contemporary digital landscape that affects billions of users worldwide.
The societal consequences of unchecked online hate speech extend far beyond individual psychological harm. At the individual level, exposure to hateful content has been associated with increased anxiety, depression, social withdrawal, and, in extreme cases, suicidal ideation [
4,
5]. At the community level, persistent hate speech fosters social polarization, normalizes bigotry, and erodes the quality of public discourse [
4]. More alarmingly, a growing body of empirical research has established direct links between the spread of online hate speech and real-world violence. Müller and Schwarz [
6] demonstrated that anti-refugee hate speech on Facebook causally predicted violent crimes against refugees in German municipalities, with the effect disappearing during major platform outages. The role of social media in facilitating mass atrocities has been tragically illustrated in Myanmar, where Facebook was identified by a United Nations fact-finding mission as an instrument for those seeking to spread hate against the Rohingya Muslim minority, contributing to a campaign of ethnic cleansing that displaced over 700,000 people [
7]. In light of such evidence, the need for scalable and reliable automated hate speech detection systems has become a pressing concern for governments, platform operators, and civil society alike.
The regulatory environment has evolved accordingly. The European Union’s Digital Services Act (DSA), which became fully applicable in February 2024, imposes explicit obligations on large online platforms to address illegal content—including hate speech—through a combination of automated detection, human review, and transparent reporting mechanisms [
8]. In January 2025, the European Commission integrated the revised Code of Conduct on countering illegal hate speech online into the DSA framework, further strengthening the requirements for proactive content moderation. These regulatory developments have intensified the demand for automated content moderation tools that are accurate across linguistic and cultural contexts, scalable to the volume of user-generated content (which exceeds millions of posts per minute on major platforms), and robust against adversarial evasion strategies [
9].
From a natural language processing (NLP) perspective, hate speech detection is typically framed as a supervised text classification problem. The field has evolved rapidly from early approaches based on handcrafted features and classical machine learning algorithms—such as bag-of-words representations paired with support vector machines or logistic regression [
10,
11]—to deep learning architectures leveraging word embeddings [
12,
13] and, more recently, pre-trained transformer-based language models [
14,
15,
16]. Fine-tuned variants of BERT [
17] and its multilingual counterpart mBERT, as well as cross-lingual models such as XLM-RoBERTa [
18], have achieved state-of-the-art performance on several English-language hate speech benchmarks and have demonstrated the ability to transfer detection capabilities across languages via shared multilingual representation spaces [
19,
20,
21].
Despite these advances, several critical challenges remain. First, the overwhelming majority of hate speech detection research has concentrated on English-language data, leaving most of the world’s approximately 7000 languages without adequate detection tools or annotated resources [
1,
2,
9]. Recent surveys have catalogued over 60 publicly available hate speech training datasets, yet the vast majority are English-centric, and only a handful cover languages from Central and Eastern Europe, the Baltics, or other underrepresented linguistic communities [
9,
22]. For low-resource languages, the scarcity of annotated corpora, the absence of language-specific NLP pre-processing tools, and the limited coverage of pre-trained models compound the difficulty of building effective detection systems [
23,
24]. Lithuanian is a prototypical example of such a low-resource scenario: despite the existence of active online communities and documented instances of online hate speech in the Lithuanian digital sphere, systematic studies on automated hate speech detection in Lithuanian have only recently begun to appear in the literature [
25].
Second, modern multilingual sentence embedding models have emerged as a promising paradigm for cross-lingual and multilingual text classification tasks. Models such as Multilingual E5 [
26], Jina Embeddings [
27], Snowflake Arctic [
28], and BGE-M3 [
29] encode texts from dozens or hundreds of languages into a shared vector space, enabling downstream classifiers to operate on fixed-dimensional representations without requiring language-specific fine-tuning. However, systematic comparisons of these modern off-the-shelf embedding models for hate speech detection—particularly in multilingual settings that include low-resource languages—remain scarce. Most existing studies either focus on a single encoder, employ task-specific fine-tuning that obscures the contribution of the base representation, or evaluate only on well-resourced languages. This gap motivates a rigorous, controlled comparison of modern embedding techniques across diverse linguistic settings under a unified experimental protocol.
Third, practical deployment of hate speech detection systems in real-world content moderation pipelines imposes stringent constraints on computational efficiency, memory footprint, and inference latency. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), offer a straightforward mechanism for compressing high-dimensional embeddings while potentially preserving discriminative information. Understanding the trade-off between compression and detection performance is critical for designing systems that can operate at scale without prohibitive computational costs, yet this aspect has received limited systematic attention in the hate speech detection literature.
In this paper, we address these gaps by investigating the effectiveness of several recent multilingual text embedding models for hate speech detection in three languages: Lithuanian, Russian, and English. We focus on sentence-level vector representations produced by six modern multilingual encoders—Gemma, Qwen3, BGE-M3, Snowflake Arctic, Jina Embeddings (v3), and Multilingual E5 (large-instruct)—and evaluate them in combination with a two-class CatBoost supervised classifier and a one-class HBOS anomaly detection approach across three binary hate speech datasets, with and without PCA-based dimensionality reduction.
Our main scientific contributions are as follows:
We devise a unified experimental framework for benchmarking multilingual sentence embeddings on multiple hate speech datasets.
We prepare and release LtHate, a 14,687-comment Lithuanian hate speech corpus with subject category and severity annotations.
We report a systematic comparison of six recent multilingual embedding models across Lithuanian, Russian and English hate speech corpora, with and without PCA-based compression.
We provide practical recommendations for model and embedding selection under computational constraints in multilingual moderation systems.
4. Methodology
A machine learning pipeline in Python 3.13.5 was devised with text pre-processing, vectorization and training of models using a 10-fold stratified cross-validation (CV) strategy on various hate speech datasets, and is available as open-source code at
https://github.com/evavaic/KTU-Misijos-HIPSTer (accessed on 16 February 2026). Optionally, dimensionality reduction with PCA is applied independently for each cross-validation training split.
4.1. Text Pre-Processing
All datasets investigated for the hate speech detection task are processed using an identical and shared pipeline implemented in Python. Each text comment is first passed through a fix_punctuation function that removes exclamation marks and
Normalizes encoding using
ftfy package [
50];
Removes hyperlinks;
Collapses repeated punctuation marks while preserving limited emphasis;
Standardizes spacing around punctuation and numbers;
Replaces emojis with shortcode text using the emoji package.
Examples of hateful comments from LT, RU and EN datasets are provided in
Table 2 for illustrative purposes of text pre-processing. The resulting cleaned texts are then processed with a text vectorization technique to obtain feature vectors suitable for machine learning model training and testing.
4.2. Sentence Embeddings
We compare six multilingual sentence embedding techniques for text vectorization (see
Table 3). Embedding models are loaded via the
SentenceTransformers package, with Jina embeddings requiring
trust_remote_code=True setting. Texts are encoded in batches with configurable batch size, and embeddings are concatenated into a matrix with 768 dimensions for gemma and 1024 for the remaining vectorization techniques. Compact embedding minishlab/potion-multilingual-128M (
https://huggingface.co/minishlab/potion-multilingual-128M, accessed on 16 January 2026) [
51,
52] of 256 dimensions was also considered in initial experiments, but due to having the lowest accuracy was later discarded.
Model google/embeddinggemma-300m (
https://huggingface.co/google/embeddinggemma-300m, accessed on 16 January 2026) [
53,
54] is a 0.3 B parameter multilingual text-embedding model. The composition of the dataset that the model is trained on is not disclosed; however, the model does well on the MMTEB [
55] benchmark, which includes the Lithuanian language. The model uses RoPE positional encodings [
56] and supports inputs of up to 2048 tokens. It has an output dimensionality of 768 with matryoshka [
57] points at 512, 256 and 128 dimensions.
Model Qwen/Qwen3-Embedding-0.6B (
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B, accessed on 16 January 2026) [
45] is a 0.6 B parameter multilingual text-embedding model. The model uses decoder-only architecture built on the Qwen3 dense backbone with 28 Transformer layers, pretrained on large, diverse multilingual corpora (web pages, code, and other text) to give broad language coverage. The Qwen3 model family covers 119 languages and dialects worldwide, including Lithuanian, among other Baltic and low-resource languages. The model outputs a 1024-dimensional embedding vector, but it can be reduced down to 32 using the matryoshka [
57] technique.
Model BAAI/bge-m3 (
https://huggingface.co/BAAI/bge-m3, accessed on 16 January 2026) [
29,
58] is a 0.56 B parameter text-embedding model specifically trained for retrieval tasks. It is trained and refined on datasets [
59,
60] that include most widespread European languages. The model supports inputs of up to 8194 tokens. It has an output dimensionality of 1024.
Model Snowflake/snowflake-arctic-embed-l-v2.0 (
https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0, accessed on 16 January 2026) [
28,
61] is a 0.6 B parameter multilingual text-embedding model based on transformer encoder architecture and trained on the MIRACL dataset [
62], which contains 18 languages. The model uses RoPE positional encodings [
56] and supports inputs of up to 8194 tokens. It has an output dimensionality of 1024, with a single matryoshka [
57] point at 256 dimensions.
Model jinaai/jina-embeddings-v3 (
https://huggingface.co/jinaai/jina-embeddings-v3, accessed on 16 January 2026) [
27,
63] is a 0.6 B parameter multilingual text-embedding model internally combining a transformer-based encoder with 5 task-specific LoRA [
64] adapters. The model is trained on 100 languages and fine-tuned on 30 languages. Interestingly, one of those 30 languages is Latvian, which is highly similar to Lithuanian. The model uses RoPE positional encodings [
56] and supports inputs of up to 8194 tokens. It has an output dimensionality of 1024 with matryoshka [
57] points at 32, 64, 128, 256, 512 and 768 dimensions.
Model intfloat/multilingual-e5-large-instruct (
https://huggingface.co/intfloat/multilingual-e5-large-instruct, accessed on 16 January 2026) [
26,
65] is a 0.6 B parameter multilingual text-embedding model based on transformer encoder architecture with weights initialized from XLM-RoBERTa large [
18,
66], which is trained on 100 languages, including Lithuanian. The model was additionally trained on datasets from the public media and fine-tuned on curated high-quality datasets, including one used in [
44]. The model supports inputs of up to 514 tokens. It has an output dimensionality of 1024.
4.3. Dimensionality Reduction
To analyze the trade-off between performance and compactness, we train models on both original embeddings and their compressed variants. Principal Component Analysis (PCA) [
67] is a linear dimensionality reduction technique that finds a new set of orthogonal axes (principal components) that capture as much of the variance in the original feature space as possible in descending order of importance, thus providing a compressed projection of the original data.
By retaining only the first 32 principal components after PCA, we obtain a compact feature vector representation that concentrates most of the variance of the original embeddings. This reduces storage and computational costs for downstream machine learning models while also acting as a form of linear noise filtering: directions with very low variance, which often correspond to noise or redundant information, are discarded. However, PCA is linear transformation that maximizes the variance of computed original vectors, so its effect on semantic similarity and discriminative power needs to be experimentally measured.
PCA in experiments here is fitted only to the training data of each cross-validation split to avoid information leakage, and consequently the learned transformation is applied to compress both training and test embeddings within that CV split.
4.4. Detection Models
In experiments we consider two types of downstream models for the detection task:
One-class (1c) anomaly detection. The histogram-based outlier score (HBOS) [
68] model from the
PyOD (Python Outlier Detection) package, used as a one-class approach trained on the target class (hate speech) examples only. The HBOS method models each feature independently using a histogram, and due to its linear complexity is suitable for very large datasets, being significantly faster than many other multivariate outlier detection methods. The outlier score is estimated based on density corresponding to histogram bin each feature falls into, with lower density values indicating anomalous instances. Default parameters (n_bins = 10, alpha = 0.1) were used and the contamination parameter was not important since it has no influence on training and raw output scores do not depend on it. Instead, the output from the model was rescaled by dividing from 10,000 (for original feature vectors) or from 100 (for PCA-transformed feature vectors) to derive a score resembling class probability prediction.
Two-class (2c) supervised classification. A gradient boosting CatBoost classifier [
69] with 500 maximum iterations, a learning rate of 0.05, depth 8,
LogLoss loss function, and
scale_pos_weight set to the ratio of negatives to positives in the training data (to address class imbalance). Within each cross-validation iteration, 80% of the training set is used to fit the model and 20% is held out for internal validation and early stopping, with a patience of 30 iterations where the best CatBoost model with respect to AUC ROC metric is retained.
Although the datasets had binary annotation of both target class (hate speech) and non-target class (neutral speech) examples, which is required to evaluate detection success on test folds, the difference between model variants used was in training steps where construction of the model used data from both classes (2c case) or only data from a target class (1c case). Such selection of methods allows a direct comparison of text-embedding quality under a strong two-class (2c) supervised and a weaker one-class (1c) pseudo-unsupervised setting. In practice this would correspond to the scope of annotation efforts where for the one-class case collection of only hate speech examples should be sufficient to create a detector. We intentionally train HBOS in a one-class configuration on hate speech samples only. Although in realistic moderation scenarios hate speech is the minority and is typically modeled as the anomalous class, here our goal is different: we probe whether modern embeddings form a compact, coherent region for hate speech examples. This setup concentrates on the structure of hate speech representations and decouples evaluation from the highly heterogeneous and domain-dependent distribution of non-hate content.
These models were selected due to better detection accuracy results after comparing HBOS to isolation forest and CatBoost to random forest alternatives. Researchers [
70] comparing various machine learning models on 165 publicly available classification problems have also found that gradient tree boosting tends to outperform bagging ensemble (random forest). Both selected models here help to compare and benchmark various multilingual embeddings under identical machine learning setups, where early stopping used in CatBoost helps to avoid overfitting for a supervised two-class case.
4.5. Evaluation Metrics
Machine learning experiment success was assessed using k-fold-stratified CV using 10 folds (k = 10) and stratified by target attribute, where the machine learning model is trained on all except one fold, with that one fold left out to test model inference. After pooling model outputs for all test folds and comparing them to ground-truth class labels, various accuracy metrics were calculated in a micro-average fashion.
To summarize detection performance, the following metrics were used:
Area under the receiver operating characteristic curve (AUC ROC), which corresponds to the probability that a randomly chosen non-target class instance will have a smaller estimated target class probability than a randomly chosen target class instance [
71]. In short, AUC summarizes the probability of correctly ranking a (neutral, hate speech) pair of text examples based on the detector’s output and is directly related to the Wilcoxon Mann–Whitney U statistic.
The precision-Recall curve (PRC) also allows calculating the area under the curve (AUC PRC) and in the case of large class imbalances the PRC is recommended over ROC [
72] when choosing a better performing detector.
Overall accuracy (Accuracy)—the best known evaluation metric, calculating the proportion of correct predictions to the ground-truth classes:
where TP is true positives (correct target class predictions), TN is true negatives (correct non-target class predictions), FP is false positives (incorrect predictions of target class examples), and FN is false negatives (incorrect predictions of non-target class examples). All these counts correspond to absolute frequences from the confusion matrix, obtained after applying the threshold to the model’s prediction (output probability of target class).
Kappa [
73,
74]—accuracy, corrected for class imbalance:
where
pe—the sum of the probabilities of the predictions agreeing with the ground truth by chance;
p0—overall accuracy of the model. According to the academic literature, the Kappa value of 0.21–0.40 indicates fair agreement and 0.41–0.60 moderate agreement. Higher values correspond to a substantial and almost perfect agreement result.
Plot-based AUC ROC and AUC PRC metrics are calculated using the model’s raw outputs before thresholding. To calculate the remaining metrics, one needs to obtain a confusion matrix by using a threshold on the model’s raw outputs to convert soft decision (class probability) to hard decision (class prediction). Since the ad hoc choice of 0.5 is usually suboptimal, we have selected a more effective—equal error rate—operating point where the ROC curve intersects with the diagonal and class recall metrics become equal; i.e., specificity becomes approximately equal to sensitivity and, consequently, to overall accuracy.
Additionally, besides pooling inference results on all test folds to get final metrics, we have also recorded the AUC ROC result for each test fold so that with the help of statistical analysis [
75] we could investigate if there are significant differences in detection performance between embeddings compared. Since splitting into folds was identical for all approaches tried on the same dataset, which corresponds to repeated measures setups, and 10 folds provide a rather small sample, the nonparametric Friedman’s test with Nemenyi’s post hoc comparison (significance level alpha = 0.05) was used in the Python package
Autorank [
76].
5. Experimental Results
In this section we outline the main findings organized by dataset. Detailed ROC and PRC curves and accuracy metrics tables—for original feature vectors and for feature vectors after PCA transformation—are presented here with an overview of the results.
5.1. Lithuanian Dataset Results
Machine learning for Lithuanian language hate speech dataset results are in
Table 4 and
Figure 1 and
Figure 2. One-class classification (see top part of
Table 4) resulted in 53.07–63.22% accuracy for original and 53.55–59.55% accuracy for PCA-compressed embeddings. Two-class classification (see bottom part of
Table 4) resulted in 73.34–78.80% accuracy for original and 70.39–76.65% accuracy for PCA-compressed embeddings. Two-class supervised classification clearly outperformed one-class anomaly detection, with slightly worse performance for PCA-compressed embeddings (drop range from e5 1.39% to qwen 1.68%).
Detection effectiveness for Lithuanian hate speech, as measured by ROC/PRC curves and complementary accuracy metrics, was highest for jina embeddings, resulting in AUC ROC of 0.868 and AUC PRC of 0.862 for two-class classification (see
Figure 1), and PCA transformation reduced those areas under detection curves slightly (see
Figure 2). Other very competitive embeddings in two-class cases were bge, snow and e5. However, both gemma and qwen embeddings demonstrated noticeably inferior results.
After comparing detection performances for two-class non-PCA setups we reject the null hypothesis (
p-value = 0.000) of the Friedman’s test and conclude that there is a statistically significant difference between median AUC ROC values from 10 CV folds. Based on the post hoc Nemenyi test (see
Figure 3) we assume that there are no significant differences within the following groups: gemma and qwen; qwen and snow; snow, e5, bge, and jina (the best group). All other differences are significant.
Overall, slightly above moderate agreement between predicted class and ground truth (best Kappa = 0.58 for jina embeddings) demonstrates average success in hate speech detection for the Lithuanian language dataset.
5.2. Russian Dataset Results
Machine learning for Russian language hate speech dataset results are in
Table 5 and
Figure 4 and
Figure 5. One-class classification (see top part of
Table 5) resulted in 67.09–80.85% accuracy for original and 64.09–75.71% accuracy for PCA-compressed embeddings. Two-class classification (see bottom part of
Table 5) resulted in 89.34–92.20% accuracy for original and 85.22–91.09% accuracy for PCA-compressed embeddings. Two-class supervised classification clearly outperformed one-class anomaly detection with minor differences (drop range from jina 0.7% to qwen 2.37%) between original and PCA-compressed embeddings in two-class cases.
Detection effectiveness for Russian hate speech, as measured by ROC/PRC curves and complementary accuracy metrics, was highest for e5 embeddings, resulting in AUC ROC of 0.978 and AUC PRC of 0.931 for two-class classification (see
Figure 4), and PCA transformation did not affect it noticeably (see
Figure 5). Other embeddings (jina, snow, bge) also were very competitive compared to e5 in two-class cases, with gemma embedding performing only slightly worse. The worst performance was for qwen embeddings, both for one-class and two-class cases.
After comparing detection performance for the two-class non-PCA setup we reject the null hypothesis (
p-value = 0.000) of the Friedman’s test and conclude that there is a statistically significant difference between median AUC ROC values from 10 CV folds. Based on the post hoc Nemenyi test (see
Figure 6), we assume that there are no significant differences within the following groups: qwen, gemma, and bge; gemma, bge, and snow; bge, snow, and jina; snow, jina, and e5 (the best group). All other differences are significant.
Substantial agreement between predicted class and ground truth (best Kappa = 0.77 for e5 embeddings) demonstrates high success levels in hate speech detection for the Russian language dataset.
5.3. English Dataset Results
Machine learning for English language hate speech dataset results are in
Table 6 and
Figure 7 and
Figure 8. One-class classification (see top part of
Table 6) resulted in 60.07–66.70% accuracy for original and 50.94–56.19% accuracy for PCA-compressed embeddings. Two-class classification (see bottom part of
Table 6) resulted in 73.71–76.89% accuracy for original and 72.45–76.72% accuracy for PCA-compressed embeddings (see bottom part of
Table 6). Two-class supervised classification clearly outperformed one-class anomaly detection, with negligible differences (drop range from jina 0.17% to qwen 1.3%) between original and PCA-compressed embeddings in two-class case.
Detection effectiveness for English hate speech, as measured by ROC/PRC curves and complementary accuracy metrics, was highest for e5 embeddings, resulting in AUC ROC of 0.855 and AUC PRC of 0.698 for two-class classification (see
Figure 7), and PCA transformation surprisingly had almost no effect for this result (see
Figure 8). Other very competitive embeddings in two-class cases were gemma, snow and jina, with notably good results for gemma embeddings.
After comparing detection performance for two-class non-PCA setups, we reject the null hypothesis (
p-value = 0.000) of the Friedman’s test and conclude that there is a statistically significant difference between median AUC ROC values from 10 CV folds. Based on the post hoc Nemenyi test (see
Figure 9), we assume that there are no significant differences within the following groups: qwen, bge, and jina; bge, jina, and snow; jina, snow, and gemma; snow, gemma, and e5 (the best group). All other differences are significant.
Moderate agreement between predicted class and ground truth (best Kappa = 0.48 for e5 embeddings) demonstrates below average success in hate speech detection for English language dataset.
5.4. Overview of All Results
Across the three hate speech datasets and six multilingual embedding models, several consistent patterns can be observed. First, two-class (2c) supervised CatBoost classifiers systematically and substantially outperform one-class (1c) HBOS anomaly detectors in terms of accuracy, Kappa, and AUC metrics for all languages and embedding families. For best results in each task on Lithuanian, LtHate accuracy for jina embeddings increased from 63.22% to 78.80%, while e5 embeddings accuracy on Russian RuToxic increased from 80.85% to 92.20% and on English EnSuperset from 66.70% (gemma) to 76.89% (e5). These results suggest that even if neutral (non-target class) examples can be time-consuming to annotate, a strong supervised approach should be preferred whenever a reasonably balanced labeled dataset can be constructed.
Second, detection effectiveness depends strongly on both the language and embeddings used. For Lithuanian LtHate, the best result was achieved with jina embeddings, but the other three embeddings (bge, e5, snow) seem to also be very competitive (with no statistically significant differences in fold-wise AUC ROC values). For Russian RuToxic and English EnSuperset the highest accuracy was achieved with e5 embeddings and other competetive embeddings were: jina and snow for Russian language; gemma and snow for English language. Overall, modern large multilingual encoder-based embeddings (jina, e5, snow, bge, gemma) consistently outperform the decoder-based qwen variant. Embedding models with excellent global multilingual benchmarks do not necessarily transfer uniformly across lower-resource target languages.
Third, simple linear PCA transformation, where the goal is to preserve the highest variance of the embeddings for the training data, seems not to deteriorate inherent semantic information and allows discrimination between classes rather successfully. Across all datasets and embedding models, accuracy values for original and PCA-compressed representations differ only marginally in the two-class setting, often by approximately up to two percentage points, but Kappa and AUC metrics appear to be effected more. However, in the one-class HBOS scenario, PCA compression can noticeably degrade performance, particularly for RuToxic and EnSuperset, indicating that fine-grained density information is more important for histogram-based anomaly scoring than for gradient-boosted decision trees.
Finally, there are clear differences in achievable performance across languages and datasets. Russian RuToxic, which is relatively large and has a moderate class imbalance, yields the highest scores (best Kappa = 0.77), suggesting that current multilingual embeddings can model Russian toxic language patterns particularly well under a supervised two-class setup. Lithuanian LtHate attains lower but still competitive results (best Kappa = 0.58), reflecting both its smaller size and the increased difficulty of modeling a newly constructed low-resource language hate speech corpus. English EnSuperset yields intermediate performance (best Kappa = 0.48), which is slightly lower than might be expected for English but this may be explained by the heterogeneity of the source corpora.
6. Discussion and Conclusions
In this paper, we presented a comparative study of six modern multilingual sentence embedding models—gemma, qwen, bge, snow, jina, and e5—for hate speech detection in Lithuanian, Russian, and English. We introduced LtHate, a new Lithuanian hate speech corpus with detailed topical and severity annotations that we reduced to a binary classification setting for experiments, and we evaluated all embedding models in both one-class (HBOS) and two-class (CatBoost) configurations with and without PCA-based dimensionality reduction. Experimental results show that contemporary multilingual encoders combined with a popular gradient-boosted classifier can achieve moderate to substantial agreement with human annotations across all three languages, with the strongest performance observed for Russian and competitive results for the newly created Lithuanian dataset.
From a practical perspective, the experiments suggest several recommendations for multilingual hate speech detection systems. Whenever it is feasible to obtain labeled examples for both hateful and non-hateful categories, a two-class supervised setup with CatBoost (or similar gradient-boosting methods) should be preferred over purely one-class anomaly detection, as the latter consistently lags behind across datasets and metrics. Among the embedding models, jina appears to be the most suitable choice for Lithuanian, whereas e5 embeddings are most suitable for the Russian and English languages; other embeddings (snow, bge, gemma) remain competitive alternatives, often with non-significant differences in AUC ROC values, but this depends on the language. Additionally, our findings indicate that applying PCA to reduce embeddings yields only a marginal loss in CatBoost classification performance, offering a straightforward way to lower memory and computation costs in real-world deployments.
At the same time, several limitations of the current study point to directions for future work. First, we focused exclusively on the off-the-shelf sentence encoders without any task-specific fine-tuning, meaning that further gains are likely achievable via supervised or contrastive adaptation on in-domain hate speech corpora. Second, our experiments considered only binary hate vs. neutral labels, whereas LtHate and many existing datasets provide richer taxonomies (e.g., fine-grained target groups and severity levels), and modeling these distinctions may be necessary for more nuanced moderation decisions. Third, the current setup is text-only and does not incorporate multi-modal information such as images, emojis beyond textual normalization, or conversation context, which are often crucial in real-world hateful or abusive content.
Therefore, an important direction for further research would be systematic evaluation of instruction-tuned large language models in zero-shot and few-shot classification regimes, as well as hybrid architectures combining frozen multilingual embedding encoders with lightweight adapters fine-tuned on hate speech and toxicity detection tasks. Also, integrating explainability techniques and bias assessments into the evaluation protocol will be essential for understanding and mitigating potential harms when deploying multilingual hate speech detectors in high-stakes, real-world moderation scenarios.