2. Related Work
The field of Thai sentiment analysis has developed since 2009, from lexicon-based methods to deep learning architectures. Past studies sought to improve sentiment analysis methods in the Thai context. In contrast, more recent studies have employed single- and hybrid transformer-based models, as well as other deep learning models, to better understand the Thai language. Previous sentiment analysis studies can be grouped into two categories as follows.
The first group emphasized lexicon-based and machine learning methods. For example, Phienthrakul et al. explored SVM with multiple kernel functions for sentiment classification in Thai text [
2]. The combination of different kernel functions and a single kernel function was evaluated, and it was reported that the combination could capture diverse linguistic features and improve classification performance compared to single-kernel approaches on product reviews. Lertsuksakda et al. proposed a novel approach to constructing Thai sentiment terms using the hourglass of emotions model [
3]. They moved simple positive–negative polarity to capture more nuanced emotional states. Sentiment lexicons reflecting complex emotional dimensions were constructed. Chirawichitchai investigated emotion classification in Thai text using different term weighting and machine learning techniques [
4]. Various term weighting schemes were evaluated to improve classification accuracy. The results showed that appropriately weighted features could enhance the performance of machine learning models for Thai emotion classification. Chirawichitchai proposed a term-weighting scheme based on the term occurrence ratio, optimized for sentiment analysis [
5]. The term weighting scheme was proposed in mathematical formulations for calculating term importance, thereby better reflecting the sentiment characteristics of Thai words. The proposed weighting scheme improved classification performance by emphasizing terms with strong discriminative power for sentiment polarity.
Pasupa et al. applied SVMs to sentiment analysis of Thai children’s stories [
6]. They extended sentiment analysis typically performed in domains such as product reviews and social media. Netisopakul and Lertsuksakda employed hypothesis testing based on observations from Thai sentiment classification experiments [
7]. A more rigorous statistical approach was used to evaluate sentiment classification methods, employing hypothesis testing to validate performance differences between approaches. Haruechaiyasak et al. introduced S-Sense, which is a comprehensive framework for sentiment analysis. The framework was specifically designed for social media sensing [
8]. It integrated multiple components, including preprocessing, feature extraction, and classification, tailored to the informal and often abbreviated language used in Thai social media. Porntrakoon and Moemeng introduced the SenseComp (Sentiment Compensation) technique for multi-dimensional analysis of consumer reviews in Thai [
9]. They observed that consumer reviews often express mixed sentiment across product dimensions. As a result, a more sophisticated analysis was required than simple overall polarity classification. To address this, the proposed SenseComp technique improved sentiment analysis by evaluating multiple attributes separately. Taemung and Chirawichitchai applied SVM to analyze sentiment in Thai product reviews [
10]. They focused on e-commerce applications and addressed practical challenges in classifying Thai-language customer reviews. The study showed that SVM could achieve reasonable performance on commercial sentiment analysis tasks. However, it still struggled to handle complex or ambiguous sentiment expressions.
The second group focused on deep learning and transformer approaches. For example, Vateekul and Koomsubha conducted comprehensive studies by applying deep learning techniques on Thai Twitter data [
11]. Various deep learning architectures, including DCNN (Dynamic Convolutional Neural Network) and LSTM, were explored for sentiment analysis of Thai social texts. The study showed that the deep learning approaches outperformed traditional machine learning methods on Thai social media sentiment analysis. The best model was DCNN, achieving the highest accuracy with 75.35%. Pasupa and Seneewong also proposed a comparative study of deep learning techniques for Thai sentiment analysis [
12]. Various deep learning architectures were systematically evaluated, and the impact of different input representations, including word embeddings, POS tags, and sentiment features, was explored. The study showed that deep learning approaches are most effective for Thai sentiment analysis. The CNN architecture, combined with the three feature types, could achieve the highest F1 of 81.70%. Thong-iad and Netisopakul compared different methods for Thai sentence sentiment tagging using Thai sentiment resources [
13]. Different sentiment lexicons and tagging approaches were evaluated for the classification task. The results showed that using adverb and adjective synsets alone could achieve the highest emotion classification accuracy. Traditional machine learning methods, such as SVM and Random Forest, perform well, but deep learning models, including CNN, LSTM, and BERT-based models, often surpass them, especially when advanced techniques such as hyperparameter tuning and incorporating linguistic features (POS, sentiment values) are employed. SVMs offer strong performance among traditional methods, whereas hybrid deep learning models and transformer models achieve higher accuracy.
Recently, transformer-based models have significantly advanced natural language processing across diverse languages [
14]. Prominent representatives of this family include BERT [
15], mBERT [
16], RoBERTa [
17], ALBERT [
18], XLM-RoBERTa [
19], and ELECTRA [
20]. Several of these models, including mBERT, ALBERT, and XLM-RoBERTa, support Thai language processing. However, prior studies have reported that multilingual models often achieve lower performance than monolingual models specifically trained and optimized for a single language, particularly in language-specific downstream tasks [
21,
22,
23,
24]. Although transformer-based pre-trained models are highly effective in capturing rich linguistic representations, they may not sufficiently emphasize sentiment-discriminative features required for fine-grained sentiment classification. Consequently, integrating transformer-based pre-trained models with deep learning architectures has been proposed as an effective strategy and has demonstrated promising results for Thai sentiment analysis [
25]. Lowphansirikul et al. [
26] proposed WangchanBERTa, a RoBERTa-based model pre-trained on large-scale Thai corpora that provides powerful contextualized representations for capturing Thai-specific semantic and syntactic information. WangchanBERTa consistently outperforms established multilingual models like mBERT and XLM-R across a variety of benchmarks, achieving in NER (Named Entity Recognition), sentiment analysis, and POS tagging. WangchanBERTa has since become a foundation model for numerous Thai NLP tasks, including sentiment analysis. Pasupa and Seneewong proposed hybrid deep learning models that combined multiple neural architectures for Thai sentiment analysis [
1]. The hybrid model, integrating CNNs and LSTMs, could capture both local features and long-range dependencies. It achieved higher performance than single-architecture models. From the proposed hybrid models, BiLSTM-CNN achieved the highest performance. It achieved macro-F1 of 74.36% on ThaiTales, 77.07% on ThaiEconTwitter, and 55.21% on the WISESIGHT datasets. Jitboonyapinit et al. investigated sentiment analysis on Thai social media using convolutional neural networks combined with long short-term memory networks [
27]. The challenges of this work are about informal language and unique linguistic patterns in Thai social media posts. The CNN-LSTM model successfully extracted spatial and temporal features. It achieved 85.00% accuracy of product reviews on social media. Khamphakdee and Seresangtakul developed an efficient deep learning approach optimized for Thai sentiment analysis [
28]. This work focused on balancing model performance with computational efficiency. Modifications to the model architecture reduced training time while maintaining high accuracy.
Nokkaew et al. analyzed online public opinion regarding major infrastructure projects using advanced machine learning and deep learning for Thai sentiment analysis [
29]. Their research demonstrated the application of sentiment analysis to policy-relevant social issues by analyzing public discourse on the Thailand–China high-speed train and Laos–China railway projects. Comment sentiment classification was performed using six approaches: linear regression, Naive Bayes, Random Forest, BiLSTM, BERT-Base-Thai, and WangchanBERTa. The WangchanBERTa model achieved 94.57% accuracy. Suraratchai and Phoomvuthisarn proposed a hybrid method combining WangchanBERTa with CNN and BiLSTM architectures for Thai sentiment analysis [
25]. Their research was built upon the pre-trained WangchanBERTa model by adding convolutional and bidirectional recurrent layers to capture additional linguistic features. The hybrid architecture achieved competitive performance on the WISESIGHT and the Thai children’s tales datasets. The Parallel Hybrid approach, WangchanBERTa-CNN-BiLSTM, achieved the highest macro-F1, reaching 62.70% on the WISESIGHT dataset and 78.59% on the Thai Children’s Tales dataset. Satjathanakul and Siriborvornratanakul focused on improving sentiment polarity classification on the Thai product reviews dataset using modern Transformer-based architectures [
30]. The fine-tuned WangchanBERTa model achieved performance metrics ranging between 66% and 93% on the product reviews dataset. Emphan et al. enhanced the performance of the sentiment analysis model using GridSearchCV for hyperparameter optimization. This work focused on classifying sentiment in electric-vehicle discussions in Thailand [
31]. Hyperparameter tuning was used to identify optimal configurations for Thai sentiment classification models. The study demonstrated that careful hyperparameter tuning could improve performance.
Previous studies have shown that transformer-based models, particularly WangchanBERTa, provide strong contextual understanding for Thai sentiment analysis, whereas hybrid architectures that combine WangchanBERTa with CNN or recurrent networks further improve performance by capturing both global and local semantic information. However, most existing approaches that combine WangchanBERTa with CNN rely on fixed convolution kernel combinations. They assume that all sentiment classes and all input sentences require identical receptive fields. This assumption is particularly problematic for highly imbalanced datasets, such as WISESIGHT, where minority classes often require different semantic scopes than the dominant classes. Although previous studies have extensively explored hybrid architectures, the problem of dynamically selecting kernel importance based on both class characteristics and instance-level semantics remains largely underexplored in Thai sentiment analysis. Therefore, this study proposes WangchanBERTa-IC-DKS. This novel hybrid architecture integrates WangchanBERTa with multi-kernel CNN and dynamic kernel selection mechanisms at both the class and instance levels. This design enables adaptive kernel weighting to improve local feature extraction, enhance minority-class recognition, and improve balanced classification performance on the Thai sentiment datasets.
4. Experiment Results and Discussion
4.1. Experimental Setup
In the experimental setup, the input sequence length was limited to 128 tokens. The model was trained with a batch size of 32 and optimized using AdamW with a learning rate of , incorporating a weight decay of 0.01 to mitigate overfitting. A dropout rate of 0.2 was also applied as a regularization technique to help improve model generalization. Together, weight decay and dropout were key regularization techniques during training. All models were trained for up to 20 epochs, with early stopping and a patience of 5 to prevent overfitting and identify the optimal training point based on validation performance. The best checkpoint for each model was selected based on the highest validation macro-F1. To ensure the robustness and stability of the experimental results, each model was trained and evaluated three times using different random seeds: 42, 123, and 777. The final reported results are presented as the mean and standard deviation across these three runs. All experiments were conducted on Google Colab using a NVIDIA Tesla T4 GPU (12 GB VRAM).
4.2. Kernel Size Comparison
The kernel size comparison experiment was conducted to investigate the impact of different kernel size combinations on the performance of the proposed WangchanBERTa-IC-DKS model. Since convolutional kernel sizes determine the range of local semantic patterns captured by the CNN module, selecting an appropriate kernel combination is crucial for effective sentiment classification.
From
Table 3, the experimental results demonstrate that the kernel candidate [2, 3, 4, 5] provides the most effective feature extraction strategy for the proposed Wangchan-BERTa-IC-DKS model on the WISESIGHT dataset. As shown in the kernel size comparison, [2, 3, 4, 5] achieves the highest macro-F1 (63.93%), which is higher than [2, 3, 4] (56.20%), [3, 4, 5] (57.40%), and [1, 2, 3, 4, 5] (51.27%). It also produces the best accuracy (73.58%), macro-precision (67.53%), and macro-recall (62.52%).
4.3. Ablation Experiment
Table 4 presents the ablation study of the proposed WangchanBERTa-IC-DKS model using the kernel candidate [2, 3, 4, 5] on the WISESIGHT sentiment dataset. The objective of this experiment is to verify the contribution of each dynamic kernel selection component, including fixed multi-kernel CNN, class-aware dynamic kernel selection (class-aware DKS), instance-aware dynamic kernel selection (instance-aware DKS), and the full proposed Instance-Class-Aware Dynamic Kernel Selection (IC-DKS) framework. All models use the same kernel candidates [2, 3, 4, 5]. The baseline model, WangchanBERTa-CNN with fixed multi-kernel convolution, achieved a macro-F1 of 56.25%, indicating that the combination of WangchanBERTa and a multi-kernel CNN effectively captures local semantic patterns for Thai sentiment classification. However, fixed kernels assume that all sentiment classes and all input sentences require the same receptive field, thereby limiting the model’s ability to handle semantic diversity across classes and sentence structures. When the class-aware DKS mechanism was introduced, the model achieved higher accuracy (70.58%) and macro-precision (63.59%), but slightly lower macro-recall (54.34%) than the fixed multi-kernel convolution model. The macro-F1 improved to 56.66%. This indicates that class-aware weighting helps the model learn class-specific kernel preferences. For the instance-aware DKS mechanism, the accuracy slightly improves to 70.81% compared to the baseline, as it enables the model to dynamically adjust kernel importance for each sentence based on the CLS representation. However, the macro-F1 (55.35%) is slightly lower than the baseline. This suggests that relying solely on instance-level adaptation without class-level structural guidance may lead to unstable kernel selection, particularly on imbalanced datasets, where minority classes require stronger prior information. The proposed WangchanBERTa-IC-DKS model, which integrates class-aware and instance-aware mechanisms, achieves the best macro-F1 score (63.93%). Compared with the fixed multi-kernel baseline, the proposed model improves macro-F1 by more than 7 percentage points, demonstrating substantial gains in balanced classification performance, especially for minority classes where macro-F1 is a more reliable metric than accuracy.
The results indicate that the class-aware and instance-aware mechanisms function in complementary rather than independent ways. The class-aware component captures relatively consistent kernel preference tendencies at the class level, while the instance-aware module dynamically refines these preferences according to the semantic characteristics of each input sentence. Their integration enables fine-grained dynamic kernel selection, thereby enabling the model to capture diverse local semantic patterns better and achieve improved classification performance on the WISESIGHT sentiment benchmark. Therefore, the performance of the proposed WangchanBERTa-IC-DKS model does not come from simply increasing model complexity, but from the effective interaction between class-level preference priors and instance-level semantic adaptation. These results support the effectiveness of the proposed framework and suggest that dynamic kernel selection may be more effective than fixed convolutional kernels for handling diverse sentiment expressions in real-world Thai social media text.
4.4. Internal Baseline Comparison
Table 5 presents the comparative performance of the proposed WangchanBERTa-IC-DKS model against the baseline WangchanBERTa and its hybrid variants on the WISESIGHT dataset. The purpose of this experiment is to evaluate whether the proposed dynamic kernel selection mechanism provides consistent performance improvements over the baseline WangchanBERTa and conventional WangchanBERTa-based hybrid architectures, including recurrent architectures such as BiLSTM and BiGRU. WangchanBERTa-IC-DKS achieves the highest macro-F1 of 63.93%, outperforming the original WangchanBERTa baseline, WangchanBERTa-BiGRU, and WangchanBERTa-BiLSTM. The WangchanBERTa-BiLSTM model achieved strong performance with an accuracy of 72.42% and a macro-F1 score of 60.09%, indicating that recurrent architectures can effectively capture sequential dependencies in Thai sentiment classification. In contrast, WangchanBERTa-BiGRU yielded lower results, with a macro-F1 of only 51.31%, suggesting that the simpler gating mechanism of GRU may be insufficient to handle the complex contextual ambiguity and informal linguistic patterns found in Thai social media text. Compared with the strongest recurrent baseline, WangchanBERTa-BiLSTM, the proposed model improves macro-F1 by 3.84 percentage points. These results demonstrate that dynamic kernel selection is more effective than both recurrent sequence modeling and static convolutional feature extraction on the WISESIGHT dataset.
4.5. Computational Efficiency and Performance Trade-Off
To further validate the practical applicability of the proposed model, we conducted an additional comparison of computational efficiency. The comparison includes total parameter size, inference time on the full test set, and macro-F1 score, as shown in
Table 6. The standard WangchanBERTa model comprises the pretrained WangchanBERTa encoder (105.24 M parameters) and a simple linear classification head (0.01 M parameters), yielding a total of approximately 105.25 M parameters. In contrast, the proposed WangchanBERTa-IC-DKS model replaces the standard classifier with the IC-DKS classification head (1.38 M parameters), resulting in a total of approximately 106.62 M parameters. Similarly, the other hybrid models replace the standard classifier with alternative task-specific heads such as CNN, BiLSTM, and BiGRU. The proposed WangchanBERTa-IC-DKS achieves the highest macro-F1 while maintaining the same parameter size as the standard WangchanBERTa-CNN baseline, despite introducing both instance-aware and class-aware dynamic kernel selection mechanisms. This indicates that the performance improvement is not primarily due to increased model complexity, but rather to more effective adaptive feature extraction enabled by dynamic kernel selection. In terms of inference efficiency, averaged over three runs, the proposed WangchanBERTa-IC-DKS achieves 18.85 s for full test-set inference, only slightly higher than the fixed multi-kernel CNN baseline (18.80 s). This relatively small difference suggests that the dynamic kernel weighting mechanism introduces only modest additional computational overhead. Compared with sequential models such as BiLSTM and BiGRU, the proposed model also provides a better balance between efficiency and performance. Although WangchanBERTa-BiLSTM achieves competitive performance (60.09%), it requires the highest computational cost (19.19 s) and the largest parameter size (107.35M). In contrast, WangchanBERTa-IC-DKS improves predictive performance while maintaining reasonable inference latency and parameter efficiency. As a result, the proposed framework provides a favorable balance between classification performance and computational efficiency.
4.6. Comparison with Previous Studies
Table 7 presents the overall performance comparison between the proposed WangchanBERTa-IC-DKS model and previous methods on the WISESIGHT dataset. The results show that the proposed model achieves the highest macro-F1 of 63.93%. The proposed model outperformed the previous BiLSTM-CNN model (55.21%) and the state-of-the-art Parallel Hybrid model (62.70%) by 8.72 and 1.23 percentage points, respectively. Although the numerical improvement over the state-of-the-art Parallel Hybrid model appears modest, this gain is meaningful because macro-F1 is the most appropriate evaluation metric for the WISESIGHT dataset due to its severe class imbalance, particularly the very small size of the question class. Unlike accuracy, macro-F1 assigns equal weight to all classes and more accurately reflects the model’s effectiveness on minority classes.
Table 8 further provides per-class performance comparison between the previous state-of-the-art Parallel Hybrid model and the proposed WangchanBERTa-IC-DKS model on the WISESIGHT dataset. The results show that the proposed model achieves substantial improvements in the negative and question classes, with F1 improvements from 76.74% to 78.11% and from 42.96% to 45.62%, respectively. This demonstrates that the proposed model is more effective at handling minority and difficult classes, a key challenge in Thai sentiment classification on social media text. For the positive class, the F1 also improves slightly, from 52.33% to 53.83%, indicating that the model can identify more positive samples despite the strong semantic overlap between the positive and neutral classes. Although the neutral class shows a slight decrease in F1 from 78.76% to 78.16%, this reduction is relatively small, whereas improvements in the minority classes lead to better-balanced classification performance. This trade-off is desirable because improving minority classes typically contributes more to macro-F1 optimization and reflects more balanced classification performance.
These findings support the effectiveness of the proposed framework on the WISESIGHT dataset. This combination of class-aware and instance-aware modules enables the model to capture both global class characteristics and local sentence-specific semantic patterns more effectively. As a result, the proposed model not only improves overall performance but, more importantly, enhances robustness to difficult and underrepresented classes, a critical requirement for real-world sentiment analysis systems.
4.7. Kernel Weight Analysis on the WISESIGHT Dataset
To improve the interpretability of the proposed WangchanBERTa-IC-DKS framework, we further analyze the kernel importance distributions on the WISESIGHT dataset from both class-aware and instance-aware perspectives. Since the proposed model dynamically assigns kernel importance based on both sentiment classes and individual sentence characteristics, understanding these distributions helps explain how different kernel sizes contribute to sentiment classification performance. Among these two perspectives, class-aware analysis provides the primary evidence for model interpretability because it reflects global kernel preference patterns at the sentiment-class level. In contrast, instance-aware analysis serves as supporting evidence by illustrating local sample-specific adaptation.
Table 9 presents the mean ± standard deviation of class-aware kernel weights across the three random seeds. The class-aware kernel weight analysis reveals that overall kernel preference patterns are partially consistent across random seeds, although the degree of stability varies across sentiment classes. Although some standard deviations remain relatively large, the interpretation focuses on consistent ranking tendencies across seeds rather than exact numerical values. For the positive class, larger kernels, especially
, tend to receive the highest average importance (0.3964), suggesting that broader contextual patterns are often useful for capturing positive sentiment expressions. However, both
and
show relatively large standard deviations, indicating that this preference is not fully stable across runs. This may be because positive sentiment expressions in Thai social media are linguistically diverse, ranging from short emotional expressions such as “ดีมาก” (very good) and “ชอบมาก” (really like) to longer context-dependent statements. As a result, the model may rely on either short-range or broader contextual patterns depending on training dynamics. For the neutral class, kernels
and
receive the highest average importance, with
showing the largest mean kernel weight (0.4672), suggesting that broader contextual patterns are more useful for modeling neutral sentiment expressions. Nevertheless, the relatively large standard deviations for both kernels indicate that the dominance between these receptive fields varies across runs, reflecting the difficulty of distinguishing neutral sentiment from weak positive or weak negative expressions. For the negative class, kernel
has the smallest standard deviation (±0.0180) while maintaining consistently high importance across all seeds. This potentially indicates that medium-range local semantic patterns may be useful for identifying negative sentiment expressions. For the question class,
has the highest mean importance (0.4025), but also a very large standard deviation (±0.3546), suggesting strong instability across seeds. This likely reflects both the inherent ambiguity of the question samples and the severe class imbalance in the WISESIGHT dataset, where the question class has the fewest training examples. Consequently, kernel preference for this class becomes more sensitive to initialization and optimization dynamics.
While class-aware analysis provides global class-level tendencies, instance-aware analysis demonstrates how the proposed model dynamically adjusts kernel importance for individual samples.
Table 10 presents representative examples from different sentiment classes.
The instance-aware analysis shows that kernel selection varies across individual samples, as expected, because the proposed Instance-Aware DKS adapts kernel importance to sentence-level semantic characteristics. For example, long positive and interrogative sentences such as S5 and S8 favor
, indicating that broader contextual patterns are important for understanding such expressions. In contrast, some negative samples, such as S7, prefer
with relatively low standard deviation, suggesting medium-range local semantic patterns. These results demonstrate that kernel selection should vary at the sample level rather than remain fixed for all inputs. This supports the motivation of the proposed DKS framework. Together, the class-aware and instance-aware analyses provide complementary insights into how the proposed WangchanBERTa-IC-DKS framework adapts kernel importance across sentiment classes and individual samples, thereby improving the interpretability of the model’s behavior. Overall, the results from
Table 9 and
Table 10 suggest that kernel preference distributions should be interpreted as empirical tendencies rather than definitive linguistic conclusions. While some relatively consistent patterns can be observed, particularly for the negative class, the variability across random seeds indicates that strong causal interpretations should be avoided. Therefore, the class-aware and instance-aware kernel analyses provide useful interpretability evidence for understanding the proposed model. Still, these observations should be framed cautiously as reasonable hypotheses rather than empirically proven linguistic properties.
4.8. Error Analysis on the WISESIGHT Dataset
The confusion matrix in
Figure 5 and the complete error analysis in
Table 11 provide deeper insight into the classification behavior of the proposed model on the WISESIGHT test set. The confusion matrix and error analysis correspond to the run whose macro-F1 was closest to the average performance across multiple random seeds, providing a representative view of the model behavior. The results show that the most frequent confusion occurred between the positive and neutral classes. Specifically, 219 positive samples were incorrectly predicted as neutral, while 144 neutral samples were misclassified as positive. This pattern is further supported by the examples shown in
Table 11 (S12–S14), where positive texts were predicted as neutral despite high confidence scores (0.917–0.986). These examples reveal that many positive expressions in Thai social media are implicit and do not contain strong sentiment words. For instance, expressions such as successful expectations, product availability inquiries with positive intent, or weak approval statements can appear semantically similar to neutral statements. As a result, the model tends to assign them to the neutral class.
A similar issue appears between the negative and neutral classes. The confusion matrix indicates that 143 negative samples were misclassified as neutral and 117 neutral samples were misclassified as negative.
Table 11 provides representative examples of both directions. In S11, a neutral statement regarding a promotion limit (“only 6 items”) was misclassified as negative because complaint-like lexical cues triggered a negative interpretation. In contrast, S15 contains a polite apology message that was labeled as negative but predicted as neutral, in which softened complaint expressions weakened the negative sentiment signal. These cases demonstrate that sentiment polarity in Thai often depends on pragmatic interpretation rather than explicit emotional words. The misclassification of positive as neutral in S12–S14 indicates that the model often struggles with implicit or weak positive sentiment. These samples do not contain strong positive sentiment words, but their meanings can still be interpreted as positive from the context. The question class also suffers from substantial confusion with the neutral class.
Table 11 (S16–S17) shows that short interrogative expressions, such as asking product availability or requesting a car price, were predicted as neutral with high confidence. This occurs because many Thai questions resemble simple information requests and often lack explicit lexical indicators that clearly signal a question, leading the model to interpret them as neutral informational statements rather than genuine questions. An important observation from
Table 11 is that several misclassified samples were predicted with very high confidence, often above 0.90. This indicates that the errors are not caused solely by model uncertainty, but rather by semantic overlap between classes and by annotation ambiguity inherent in the dataset. Therefore, the main challenge is not only feature extraction but also the subtle boundary between sentiment categories, especially positive–neutral and question–neutral pairs.
Overall, both the confusion matrix and qualitative examples consistently indicate that the primary challenge in Thai sentiment classification lies in distinguishing semantically similar classes and handling minority classes. Although the proposed instance-class-aware dynamic kernel selection mechanism improves local semantic representation and achieves the best overall macro-F1 performance, implicit sentiment expressions and severe class imbalance remain important limitations, and they constitute promising directions for future work.
4.9. Performance of the Proposed Model on the 40 Thai Children’s Tales Dataset
Table 12 presents the performance comparison between the previous studies and the proposed WangchanBERTa-IC-DKS model with kernel candidate [2, 3, 4, 5] on the 40 Thai children’s tales dataset. The results show that the proposed model achieves a macro-F1 of 79.95%, outperforming the previous SOTA result of 78.59%. The proposed model can improve the F1 for the positive class (a minority class) from 75.35% to 80.79%, but with slightly lower F1 for the neutral and negative classes (as shown in
Table 13). Although the numerical improvement of macro-F1 is 1.36 percentage points, this gain is meaningful because macro-F1 is the most reliable metric for multi-class sentiment classification, particularly when balanced performance across all sentiment classes is required rather than overall accuracy alone. The proposed model also achieves the highest accuracy (80.42%), macro-precision (81.38%), and macro-recall (79.61%), indicating that the improvement is not limited to a single evaluation aspect but reflects a more robust classification capability across all sentiment categories. The performance of WangchanBERTa-IC-DKS demonstrates that dynamic kernel selection remains effective not only for social media sentiment classification, such as WISESIGHT, but also for literary and narrative text classification in children’s tales.
Figure 6 and
Table 14 consistently demonstrate that the main classification difficulty in the 40 Thai Children’s Tales dataset lies in distinguishing between neutral, positive, and negative sentiments, particularly when emotional polarity is weak, implicit, or context-dependent. From the confusion matrix in
Figure 6, substantial confusion arises between negative and neutral classes, with 17 negative samples incorrectly predicted as neutral, and between positive and neutral classes, with 13 positive samples misclassified as neutral. This suggests that the model tends to exhibit a strong neutral bias, particularly when emotional signals are subtle or indirect. This observation is supported by the detailed error cases presented in
Table 14. For example, sample S21 (Neu→Neg) shows that the presence of the context-related negative cues may have biased the model toward the negative. Similarly, S22 (Pos→Neu) demonstrates that weak positive sentiment and contextual ambiguity make positive meaning difficult to detect, leading the model to prefer the neutral class. In S23 (Neg→Neu), complaint-like lexical cues trigger a misleading negative interpretation despite the actual neutral intent, while S24 (Pos→Neu) shows that softened or polite expressions weaken clear emotional polarity. Another frequent issue appears in sentences with ambiguous structural patterns. For instance, S25 (Neg→Neu) shows that question-like sentence forms resemble neutral information requests, making classification difficult without strong interrogative markers. Likewise, S26 (Neu→Pos) indicates that short interrogative expressions lacking explicit question indicators confuse the classifier, causing incorrect polarity assignment. Sample S27 (Neu→Pos) further shows that mixed contextual cues and ambiguous sentiment polarity blur class boundaries and reduce prediction reliability.
Overall, the most errors arise not from strong sentiment expressions but from implicit sentiment, softened emotional language, complaint-like wording, and ambiguous contextual clues, which are common characteristics of Thai children’s tales. These findings explain why the model frequently defaults to the neutral class and highlight the limitation of relying primarily on surface lexical features. Future improvements may require stronger context-aware semantic modeling and better handling of pragmatic and implicit sentiment expressions to reduce these boundary-level classification errors.
5. Conclusions and Future Work
This study proposed WangchanBERTa with Instance-Class-Aware Dynamic Kernel Selection (WangchanBERTa-IC-DKS) for sentiment analysis on short, noisy, and highly imbalanced Thai texts. The proposed framework integrates the contextual representation capability of WangchanBERTa with a multi-kernel convolutional neural network and a dynamic kernel selection mechanism that jointly considers both instance-aware and class-aware information. Unlike conventional WangchanBERTa-CNN models that rely on fixed combinations of convolutional kernels, the proposed IC-DKS model dynamically adjusts kernel importance based on both sentence-level semantic characteristics and sentiment-class preferences, enabling more effective local feature extraction and improved minority-class recognition. Experimental results on the WISESIGHT sentiment benchmark showed that the kernel candidate [2, 3, 4, 5] achieved the best overall performance within the investigated parameter space, suggesting that kernel size selection is important for Thai sentiment classification. The observed kernel preference distributions suggest that different sentiment classes may benefit from different receptive fields, indicating that fixed kernel selection may be less effective for capturing diverse local semantic patterns across sentiment classes. The proposed WangchanBERTa-IC-DKS model achieved the highest macro-F1 of 63.93%, outperforming the previous state-of-the-art Parallel Hybrid model (62.70%) by 1.23 percentage points. Ablation studies also verified that combining both class-aware and instance-aware dynamic kernel selection improves balanced classification performance compared with fixed multi-kernel CNN baselines. The effectiveness of the proposed model was further validated on the 40 Thai children’s tales dataset, which comprises shorter, less noisy narrative texts with imbalanced classes. Experimental results on the 40 Thai children’s tales dataset further supported the effectiveness of the proposed model, achieving a macro-F1 of 79.95%, which outperformed the previous state-of-the-art result (78.59%) by 1.36 percentage points. These findings demonstrate that dynamic kernel selection is more effective than fixed convolutional kernels for real-world Thai sentiment analysis, particularly on highly imbalanced datasets such as WISESIGHT. In addition, the proposed framework improves robustness for difficult minority classes, which remains one of the most challenging categories in Thai sentiment classification.
Although the proposed model achieves competitive performance, several limitations remain. First, the experiments were conducted on only two Thai sentiment datasets, namely WISESIGHT and the 40 Thai children’s tales dataset. Although these datasets represent both noisy social media text and short narrative text, further evaluation on additional Thai sentiment datasets from different domains would provide stronger empirical support for the robustness of the proposed model across different classification settings. Second, severe ambiguity between the neutral and positive classes, as well as between the neutral and negative classes, remains a challenging issue. Future work may explore hierarchical two-stage classification strategies, in which the model first distinguishes between neutral and non-neutral instances and then further classifies instances into positive, negative, and question classes. This may help reduce the strong neutral bias observed in the confusion matrix. In addition, neutral-bias calibration mechanisms, dynamic threshold-adjustment techniques, and hard example mining based on confusion-prone samples could be further investigated to improve minority-class robustness and balanced classification performance.