1. Introduction
In the era of the explosive growth of user-generated online content, toxic text detection has emerged as a critical challenge at the intersection of cybersecurity and social governance [
1,
2,
3,
4,
5]. Toxic text, spanning hate speech, harassment, misinformation, and other harmful content, poses significant risks to mental health, social stability, and platform integrity [
6,
7,
8]. Traditional manual moderation struggles to cope with the scale and diversity of modern digital interactions, necessitating automated solutions rooted in data-driven intelligence. By leveraging advances in natural language processing, machine learning, and big data analytics, researchers are now developing data-driven hybrid frameworks that combine analytical reasoning, computational modeling, and adaptive learning to decode implicit toxicity patterns [
9,
10,
11].
Existing approaches to toxic text detection can be broadly categorized as rule-driven or data-driven. Rule-driven approaches depend on predefined lexicons, syntactic rules, or pattern-matching algorithms to spot toxic content, valuing interpretability through tools like keyword blacklists or regular expressions [
12,
13]. However, they struggle to adapt to changing language subtleties, slang, or culturally specific toxicity. Data-driven approaches utilize deep learning to automatically identify toxicity via latent semantic patterns in text embeddings and can be further divided into several subgroups [
13,
14,
15]. Convolutional Neural Network (CNN)-based methods employ convolutional operations to extract salient features from text data, effectively capturing local features and spatial hierarchies in the input [
16,
17]. Transformer-based approaches utilize self-attention mechanisms to model intricate relationships between words, allowing the model to weigh the importance of different words in relation to each other and capture long-range dependencies [
18,
19]. Long Short-Term Memory (LSTM)-based methods process sequential data to capture contextual information, making them adept at understanding the order and flow of text sequences [
20,
21]. Bert-based approaches benefit from pretraining on large-scale text corpora, enabling them to capture rich linguistic features and contextual information through masked language modeling and next sentence prediction tasks. Ensemble-based learning methods combine the strengths of multiple models to improve performance, reducing the risk of overfitting and enhancing generalization [
22,
23]. In recent years, with the emergence of large-scale language models (LLMs), LLM-based methods have been increasingly applied to toxicity detection tasks through fine-tuning or prompt-based adaptations [
24,
25,
26]. While they offer powerful general language understanding capabilities, their application to specialized tasks like toxic text detection still requires careful adaptation. Unlike traditional data-driven models that are specifically fine-tuned for toxicity detection, LLMs can sometimes provide a broader contextual awareness but may lack the specialized focus needed for high-precision detection in specific domains.
Despite the encouraging performance achieved by data-driven methods, they still face several limitations: First, they may fail to capture the rich semantic information in text, especially when dealing with the complex grammatical structures unique to Portuguese. For example, Portuguese has an intricate system of verb conjugations and noun declensions that vary based on tense, mood, aspect, gender, and number. The prevalence of polysemy, where words can have multiple meanings depending on the context, also poses challenges. Second, data-driven approaches frequently struggle to address the uncertainty inherent in model decision making. The decision-making process of these models is often based on probabilistic outputs and confidence scores, which can be unreliable due to factors such as limited training data, class imbalance, and the presence of noise in the input text. Third, many data-driven methods rely on static fusion strategies to aggregate multi-view information for model decision making. These strategies typically combine features from different modalities or sources using fixed weights or predefined rules. However, this approach cannot adaptively adjust to the characteristics of different samples. In reality, the importance of various features or views may vary significantly across samples. For instance, in some cases, textual content might be more critical, while in others, structural or contextual information could be more relevant. Static fusion strategies fail to account for these variations, which can lead to suboptimal model performance. They do not provide the flexibility needed to optimally integrate multi-view information for each specific sample, thus limiting the overall effectiveness of the model in complex and dynamic text analysis tasks.
To address these challenges, this paper proposes a comprehensive and dynamic toxic text detection method for recognizing sustainable development insights, which integrates multi-view feature augmentation, entropy-oriented invariant learning, trustworthy comment recognition, and evidence-based information fusion. Specifically, the multi-view feature augmentation defines a dual-stream encoding architecture with the help of BiLSTM and BERT, to capture the local and global information of text. Entropy-oriented invariant learning minimizes the conditional entropy between representations extracted by different feature encoders to align complementary information, which improves the generalization of representations. Meanwhile, trustworthy comment recognition conducts the Dirichlet function to measure the uncertainty estimation of model predictions, which ensures the reliability of the model. Finally, evidence-based information fusion dynamically aggregates information from multiple views based on the uncertainty of each view, ensuring that the model can adaptively leverage the most informative features for each sample. Through these components, our method aims to overcome the limitations of traditional approaches and provide a more accurate and reliable solution for toxic language detection in sustainable development insights.
The key contributions of this paper include as follows:
We introduce a dual-stream framework combining BiLSTM and BERT with entropy-oriented invariant learning, effectively capturing comprehensive semantic features and enhancing generalization in toxic text detection.
We also propose a novel trustworthy comment recognition strategy using the Dirichlet function for uncertainty estimation, coupled with evidence-based dynamic information fusion, which significantly improves the reliability and accuracy of detection results.
Extensive experiments conducted on real-world datasets verify that the proposed method provides a more effective and reliable solution for toxic text detection in sustainable development insights.
The subsequent sections of this paper are structured as follows:
Section 2 provides an in-depth exposition of the proposed method.
Section 3 presents a comprehensive evaluation of experiment results.
Section 4 concludes the study by summarizing key findings and outlining future research directions.
2. The Proposed Method
Mathematically, let
denote a sustainable development insight text dataset containing
N textual posts. The goal of the toxic text detection task is to learn a prediction function
. To this end, a trustworthy toxic text detection method is proposed for perceiving sustainable development insights, which contains multi-view feature augmentation, entropy-oriented invariant learning, trustworthy comment recognition, and evidence-based information fusion, as shown in
Figure 1.
2.1. Multi-View Feature Augmentation
To address the complexity and diversity of toxic text patterns in sustainable development insights, we propose a dual-stream parallel feature extraction framework combining Bidirectional Long Short-Term Memory (BiLSTM) and BERT. This architecture leverages complementary advantages of both models to enhance feature representation.
Global semantic modeling: The BERT model, based on the Transformer’s self-attention mechanism, achieves three-dimensional semantic modeling of global text. Its core strength lies in synchronously capturing semantic correlations between any positions in a sequence through parallelized computation, breaking the local window limitations of traditional models and constructing cross-sentence semantic topological networks at the word-vector level. This unique capability enables precise parsing of deep semantic structures (e.g., logical reasoning, coreference resolution), particularly through context-sensitive word representations acquired via large-scale pretraining, which accurately distinguishes semantic differences of identical words in contexts.
Given an input text sequence
, where
L denotes the length of the text. BERT generates contextualized embeddings through:
where
denotes token embeddings,
denotes the segment embeddings (to distinguish between sentences), and
denotes the positional embeddings. Then, BERT uses multiple Transformer encoder layers, each consisting of a multi-head self-attention mechanism
and a feed-forward neural network
.
where
s denotes the number of encoding layers. The final representations of BERT are denoted as
.
Local contextual dynamics: The BiLSTM model specializes in extracting local dynamic features of text streams through its gated recurrent architecture. By modeling bidirectional temporal sequences, it traces combinatorial patterns of evaluative elements (e.g., “screen-clarity-stunning”) in variable word orders. Even when evaluation subjects and sentiment words are separated by over 20 characters (e.g., “This phone, though I waited half a month for delivery, its AMOLED display effect truly...”), it reliably establishes precise modifier relationships. For fragmented expressions in sustainable development insights texts (e.g., frequent pronoun jumps like “this” or “it”), the model automatically repairs semantic discontinuities through hidden state transmission, effectively handling scenarios with omitted subjects like “Just received it with scratches, so disappointing.” Crucially, its incremental information processing mechanism inherently filters noise such as typos and emoji insertions.
Given an input text sequence
, where
L denotes the length of the text, BiLSTM processes sequences bidirectionally through two LSTM layers to generate the corresponding representations:
where
represents the forward hidden state, and
represents the backward hidden state. These two hidden states are combined to form the final bidirectional hidden state
. Here,
and
are weight coefficients. Through this method, the model can consider both contextual information simultaneously and update the network parameters during training through forward and backward propagation.
2.2. Entropy-Oriented Invariant Learning
The core motivation of view-invariant representation learning stems from addressing semantic inconsistency across multi-view features and overcoming generalization bottlenecks in models. Since BiLSTM and BERT extract features from distinct perspectives, i.e., local sequential dependencies and global contextual interactions, respectively, they inherently differ in syntactic sensitivity and semantic abstraction levels. This discrepancy may lead to shifts in the representation space of the same text across different models. Such view-specific biases can weaken the model’s ability to capture essential semantics. View-invariant representation learning mitigates this by constraining the distribution of representations across views, compelling the model to discard view-dependent interference and instead focus on high-order semantic information shared across views. Specifically, by minimizing conditional entropy between representations, the model implicitly builds semantic mapping bridges between views, aligning representations from different perspectives into unified semantic concepts in latent space. This approach not only resolves information conflicts during multi-view fusion but also provides downstream tasks with purer, more universal semantic encodings.
Specifically, the conditional entropy between representations
z and representations
h is defined as:
where
quantifies the uncertainty of the representations
h given the representations
z. It is computed as the negative expectation (over the joint distribution
) of the conditional probability
. Minimizing
implies reducing the uncertainty of
h when
z is known.
Directly computing
is intractable, so a variational distribution
is introduced to approximate
. By maximizing the expectation of
over
, we maximize a lower bound of the original objective
. This leverages the Evidence Lower Bound principle from variational inference, as follows:
Then, we assume
Q as a Gaussian distribution
,
where
is the mean (parameterized by a cross-view mapping function) and
is the covariance matrix. Substituting the Gaussian density into
, we obtain the logarithmic likelihood term involving the squared error
.
Expanding the Gaussian log-likelihood yields two terms: A squared error term
, which penalizes deviations of
from
h. A constant term
, independent of the optimization variables. Since constants do not affect the optimization, they can be ignored; then we have,
Since the true data distribution is unknown, we utilize Monte Carlo estimation to approximate the expectation using finite samples
, resulting in the sample mean squared error:
Further, in cross-view learning, to enhance bidirectional consistency, the loss function is extended into a symmetric form:
2.3. Trustworthy Comment Recognition
Current toxic language detection models typically employ the softmax activation function for classification predictions. However, the softmax function merely converts the model’s logits into a probability distribution, lacking an estimation of decision uncertainty. This characteristic can cause the model to appear overly confident when dealing with complex or ambiguous inputs, even if its predictions may not be reliable. For instance, even when there is insufficient evidence to support the classification of certain samples, the probability values output by softmax can still be very high. This overconfidence may lead to erroneous decisions in practical applications, such as content moderation. To this end, a trustworthy comment recognition strategy is designed by introducing the uncertainty estimation into pattern decisions with the help of the Dirichlet function.
Specifically, the Dirichlet function is defined via
C parameters
as follows,
where
and
represent the
C-dimensional multivariate
function and simplex, respectively. To achieve the Dirichlet function, the Dirichlet network is designed using the softplus activation function instead of the softmax activation function to generate evidence representations
. Subsequently, the parameters of the Dirichlet distribution are derived based on evidence representations as follows:
After obtaining the parameters of the Dirichlet distribution, the uncertainty estimation
u of pattern decisions and confidence level
for each category are modeled as follows:
where
u and
are all non-negative and the sum is 1:
Then, we infer the variational lower bound of the Dirichlet function as the loss function to guide pattern mining of the toxic language detection. Given observations
where
denotes the corresponding labels of the text
, the generative process is defined as follows:
According to the previous work [
17], the marginal likelihood of
is rewritten as:
where
is instantiated as the neural network, and
is formulated as:
where
represents the Kullback–Leibler divergence, a measure of how one probability distribution diverges from a second, expected probability distribution. This divergence is inherently non-negative, reflecting the information loss when approximating one distribution with another, and ensures that
serves as a robust lower bound for the marginal likelihood
. Thus, we optimize
to achieve the marginal likelihood maximization. More specifically, we encapsulate the integral of the cross-entropy loss into the Dirichlet distribution
as the first term of
:
where
is the digamma function.
denotes the
k element of one-hot label
y. For the second term of
, a prior constraint is introduced to obtain a good Dirichlet distribution that should be concentrated on the vertices corresponding to the simplex, which implies that the parameters of the Dirichlet distribution should be as close to 0 as possible, except for the parameter corresponding to the correct label:
where
.
denotes a vector of ones.
2.4. Evidence-Based Information Fusion
When obtaining decision-making information from two views, we often use view-based weighting methods to integrate their complementary information, which helps improve decision-making accuracy. Nevertheless, conventional weighting methods are commonly static. Once set, the weights remain fixed throughout the decision-making process without adapting to individual sample characteristics. In other words, these fusion methods overlook that different samples may have varying significance across views, which can cause the decision-making performance to deteriorate. To this end, we model the uncertainty of the Dirichlet distribution in each sample of each view to integrate multi-view decision-making information based on Evidence Theory.
Specifically, after obtaining the uncertainty estimations
and
of pattern decisions and confidence levels
and
for each category, the evidence-based fusion rule is defined as follows:
where
and
denote the fused uncertainty estimation and the confidence level, respectively. The corresponding fusion Dirichlet distribution
is obtained through:
Then, a multi-view evidence-based fusion loss is designed to achieve an optimized fusion Dirichlet distribution, as follows
The evidence-based information fusion component, in conjunction with a Dirichlet parameterization-based classification network and evidence fusion theory, is capable of dynamically identifying views that pose risks to decision making and leveraging informative views in the final decision. This enables the model to make accurate classification decisions even when faced with diverse sustainable development insights and complex sample conditions.
2.5. The Overall Loss Function
We define the following loss
L to train the proposed method to obtaining the final toxic language detection results:
where
denotes the entropy-oriented invariant learning loss that is used to guide the multi-view representation extraction.
denotes the trustworthy comment recognition loss that is used to guide the toxic language detection.
denotes the evidence-based information fusion loss that is used to guide the sample-level decision aggregation.
and
denote trade-off parameters.
In summary, the proposed method for toxic language detection in sustainable development insights offers several significant advantages. By integrating multi-view feature augmentation, entropy-oriented invariant learning, trustworthy comment recognition, and evidence-based information fusion, it comprehensively addresses the limitations of traditional approaches. The dual-stream feature extraction framework combining BiLSTM and BERT ensures rich and robust feature representation from both local and global perspectives. The entropy-oriented invariant learning strategy enhances the model’s generalization ability by aligning semantic representations across different views, effectively handling semantic inconsistencies. The introduction of uncertainty estimation through the Dirichlet function in the trustworthy comment recognition component provides a more reliable decision-making process, quantifying the confidence of predictions. Furthermore, the evidence-based information fusion mechanism dynamically aggregates multi-view decisions, adapting to individual sample characteristics and reducing overconfidence in predictions. Collectively, these components make the proposed method more accurate, reliable, and well-suited for detecting toxic language in the complex and diverse contexts of sustainable development insights, advancing the state of the art in this critical area of research.
3. Experimental Evaluation
3.1. Setup
Datasets: Following previous works [
27,
28,
29,
30], two common datasets, namely SustODS-PT and ToxiLuso-EC, are employed to evaluate the performance of the proposed method in the toxic language detection task. SustODS-PT comprises 10,000 sustainable development insight texts discussing sustainable development topics within the Portuguese-speaking community. These comments are collected from diverse digital systems and cover a wide range of viewpoints and expressions related to sustainability initiatives. Similarly, ToxiLuso-EC contains 6497 sustainable development insight texts focusing on sustainable development content, gathered from various online forums and social networks where Portuguese is predominantly used. Both datasets are formatted as binary classification datasets. In these datasets, label 1 is assigned to text content that conveys negative emotions, such as anger or frustration, negative evaluations like criticism or disapproval, or more intense emotions such as threats or hate speech. Conversely, label 0 is assigned to text content that is neutral, expressing factual information without strong emotional language, ordinary statements that do not lean towards any extreme sentiment, or somewhat positive text content that reflects support, approval, or optimism regarding sustainable development initiatives. This binary classification setup allows for a clear distinction between toxic and non-toxic language, facilitating the evaluation of the proposed method’s effectiveness in identifying harmful content that could undermine constructive discourse on sustainability.
Metrics: Following previous works [
22,
23,
31], Precision, Recall, F1, and Accuracy are utilized as metrics to evaluate the performance of the proposed method on the two above datasets. Precision measures the proportion of correctly predicted toxic language instances among all instances predicted as toxic, reflecting the accuracy of positive predictions. Recall quantifies the percentage of actual toxic language instances correctly identified by the model, indicating its ability to detect true positives. The F1 score balances Precision and Recall through their harmonic mean, offering a comprehensive metric of the model’s performance in handling imbalanced datasets. Accuracy represents the ratio of correctly predicted instances to the total instances predicted, providing an overall effectiveness measure of the model across both toxic and non-toxic language detection.
Implementation Details. Our method is deployed using the open-source framework PyTorch 2.0.1 and is conducted on an Ubuntu 20.04 server equipped with an NVIDIA Tesla V100 GPU. In the experiments, the AdamW optimizer with the learning rate 5 × 10
−4 is selected for optimizing the overall network. In the optimization, the epoch number and the batch size are set to 200 and 16, respectively. Following previous works [
32,
33,
34,
35], the training set, the validation set, and the test set account for 60%, 20%, and 20% of the datasets, respectively. When the validation set’s loss fails to improve for 5 consecutive epochs, training is stopped early. To ensure the reproducibility of the results, a fixed random seed of 42 is used throughout all experiments. The code and datasets are available at:
https://github.com/liumeng1541/toldbr-bert-text-classification-pt-br (accessed on 6 June 2025).
3.2. Comparison with Baselines
Baselines. The 11 baselines in the toxic text detection are used as comparison methods, which can be partitioned into four groups, concluding convolutional neural network-based methods, i.e., GCNN-TTD [
32] and RCNN-TTD [
27], Transformer-based methods, i.e., TTD [
13], LSTM-based methods, i.e., ToXCL [
7] and TTD [
14], and Bert-based methods, TLD [
1], FFDC [
11], MMTLD [
9], and CA-MTL [
15], and ensemble-based learning methods, i.e., RTLD [
8] and ALD [
16]. Three strategies are used for fair comparison in the experiments: (1) The experiment environments for all comparison methods are the same. (2) The grid search trick is used on the trade-off parameters suggested by the corresponding comparison methods to obtain the best performance. (3) The experiment results are the average results of five runs for all comparison methods.
Results. The experiment results are presented in
Table 1 and
Figure 2. It can be observed that our proposed method demonstrates superior performance across both datasets in all four metrics. On the SustODS-PT dataset, our method achieves the highest Precision of 0.7620, Recall of 0.7584, F1 score of 0.7632, and Accuracy of 0.7548. Similarly, on the ToxiLuso-EC dataset, our method attains the best Precision of 0.7974, Recall of 0.7816, F1 score of 0.7852, and Accuracy of 0.7840. Compared to the baseline methods, our approach shows significant improvements. For instance, on the SustODS-PT dataset, our method outperforms the second-best results by 0.0348 in Precision, 0.0282 in Recall, 0.0316 in F1 score, and 0.0184 in Accuracy, respectively.
Analysis. The effectiveness of the proposed method can be further explained by its unique components. The multi-view feature augmentation strategy leverages the complementary strengths of BERT and BiLSTM to enhance feature representation. BERT’s global semantic modeling captures deep semantic structures through self-attention, while local contextual dynamics of BiLSTM extract sequential patterns and handle fragmented expressions. The entropy-oriented invariant learning addresses semantic inconsistencies across views by minimizing conditional entropy, ensuring that representations from different perspectives are aligned in latent space. The trustworthy comment recognition strategy introduces uncertainty estimation through the Dirichlet function, providing a more reliable decision-making process by quantifying decision uncertainty. Finally, the evidence-based information fusion dynamically integrates multi-view decision-making information, adapting to individual sample characteristics and reducing the risk of overconfidence in predictions. These components collectively contribute to the superior performance of our method in toxic text detection for sustainable development insights.
3.3. Parameter Analysis
This subsection explores the impact of trade-off parameters
and
on the detection performance by conducting sensitivity analysis experiments. Specifically, the values of parameters
and
are restricted to the set
. By fixing one parameter at a specific value and changing the values of another parameter, the changing trends of the classification performance are recorded in
Figure 3. The results show that when
and
are set to 1, the model achieves the best overall performance in terms of Precision, Recall, F1, and Accuracy. When the values of
and
are too small (e.g., 0.001 or 0.0001), the model tends to underfit, resulting in lower detection accuracy. Conversely, when the values are too large, the model may overfit to the training data, leading to poor generalization performance on the test set. Therefore, in the experiments,
and
are set to 1, respectively.
The experimental results presented in
Figure 4 show the influence of different learning rates (r) on the performance of the proposed method on the SustODS-PT dataset. The learning rate is a crucial hyperparameter that affects how quickly and effectively the model converges during training. A suitable learning rate can help the model achieve better performance in terms of Precision, Recall, F1, and Accuracy. When the learning rate is set to 0.0001, the model demonstrates relatively balanced performance across all metrics. However, as the learning rate increases to 0.001, there is a noticeable improvement in Precision and Recall, indicating that the model can better distinguish between toxic and non-toxic language at this rate. Further increasing the learning rate to 0.01 leads to a slight decrease in performance, suggesting that the model may start to overshoot optimal parameters. When the learning rate is set to 0.1, the performance drops significantly across all metrics, which may be due to the model becoming unstable during training and failing to converge properly. Therefore, the learning rate should be carefully tuned to balance the speed of convergence and the quality of the final model. In practice, the learning rate is set as
to achieve optimal results on the two datasets.
3.4. Ablation Study
The ablation study in
Table 2 evaluates the contributions of different components of the proposed method on the SustODS-PT dataset. The study systematically investigates the impact of individual and combined loss functions on the model’s performance.
In the single-view setting, using only results in lower performance across all metrics, indicating that entropy-oriented invariant learning alone is insufficient for capturing the complexities of toxic language detection. When only is used, there is a significant improvement in Precision and Recall compared to , highlighting the effectiveness of the trustworthy comment recognition component in isolation. However, the performance is still suboptimal, suggesting that combining multi-view perspectives is essential for better representation learning.
The dual-view experiments demonstrate that combining and leads to a notable performance boost compared to single-view approaches. This indicates that integrating both entropy-oriented invariant learning and trustworthy comment recognition leverages the complementary strengths of global semantic modeling and local contextual dynamics, enhancing the model’s ability to distinguish toxic language patterns. Further improvements are observed when evidence-based information fusion () is added to either or , with the combination of yielding higher Precision and Recall. This underscores the value of dynamically aggregating multi-view decision-making information based on uncertainty estimation, which helps in making more reliable classification decisions.
The complete integration of all three losses () achieves the highest performance across all metrics, with Precision, Recall, F1-score, and Accuracy reaching 0.7620, 0.7584, 0.7632, and 0.7548, respectively. This comprehensive approach not only captures rich semantic information from multiple perspectives but also effectively fuses decisions through evidence theory, leading to the most robust and accurate toxic language detection. The results confirm that each component plays a crucial role in the overall effectiveness of the proposed method.
3.5. Statistical Analysis
To assess the statistical significance of the performance differences between our method and the baselines, we conducted the Nemenyi test [
36] on both datasets. The results are visualized in
Figure 5. The Nemenyi test is a post hoc test used after the Friedman test to determine if the performance differences between algorithms are statistically significant. In the figure, each line represents the average rank of a method across different metrics, along with its confidence interval. If the confidence intervals of two methods do not overlap, their performance difference is considered statistically significant. On the SustODS-PT dataset, our method achieved the highest average rank, and its confidence interval does not overlap with those of most baseline methods except for ALD. Similarly, on the ToxiLuso-EC dataset, our method also obtained the highest average rank, and its confidence interval does not overlap with those of all baseline methods except for RTLD and ALD. This indicates that our method significantly outperforms the majority of the baseline methods. The results of the Nemenyi test further confirm the superiority of our method in toxic language detection on both datasets.