1. Introduction
Sentiment Analysis (SA), the computational study of opinions and emotions in text, has become an essential tool for understanding public opinion across diverse domains [
1]. It enables researchers and organizations to efficiently measure how people feel about products, services, policies, and events by analyzing large volumes of textual feedback. However, traditional SA typically treats an entire document or sentence as a single unit with one overall sentiment label (e.g., positive, negative, or neutral). This approach can omit important details in texts covering multiple topics. Aspect-Based Sentiment Analysis (ABSA) addresses this limitation by determining the sentiment with respect to specific aspects or topics mentioned within a text [
2]. In other words, ABSA provides a fine-grained view of opinions by identifying what exactly each sentiment is about. For instance, a product review might praise the “battery life” of a laptop, but criticize its “screen” quality. While the review could be seen as ambivalent overall, an ABSA would reveal a positive sentiment towards the “battery life” aspect and a negative sentiment towards the “screen” aspect. This level of granularity is crucial for fully capturing complex sentiments. When the relevant aspects correspond to broader themes or subjects, this approach is often referred to as Topic-Based Sentiment Analysis (TBSA) [
3,
4]. The term “topics” is used in this study instead of “aspects” to more accurately describe the broader thematic categories addressed in the interviews, which involve complex educational dimensions rather than the simple product features typically analyzed in ABSA.
The need for TBSA becomes apparent in contexts where feedback encompasses diverse themes, such as education during crises [
5]. A prominent recent example is Emergency Remote Teaching (ERT), the rapid shift to online instruction implemented as a temporary response to emergency situations like the COVID-19 pandemic. Unlike well-planned online courses, ERT was implemented on short notice, leading to a wide range of experiences and reactions from teachers, students, and parents [
6]. Each stakeholder’s perspective on ERT touches on multiple aspects of the teaching and learning experience. For instance, a teacher might appreciate the flexibility of teaching from home, yet feel frustrated by reduced student engagement. Similarly, a student might enjoy the comfort of remote attendance, but struggle with technical difficulties. A school director might value the rapid implementation of digital tools during remote teaching, but simultaneously express concern over the lack of preparedness among staff and gaps in student performance. A single overall sentiment score for an interview or survey response in such cases would fail to capture these conflicting feelings. In contrast, a TBSA can identify sentiments tied to each issue (e.g., positive about flexibility, negative about engagement), offering a much richer understanding of the feedback. Analyzing interview transcripts with this fine-grained lens not only highlights which aspects of ERT were viewed positively or negatively, but also helps educators and policymakers identify specific areas of success or concern. However, manually extracting and evaluating sentiments on a per-topic basis from many in-depth interviews is labor-intensive and subject to human bias. This challenge underscores the importance of automated TBSA techniques to systematically and consistently interpret the sentiments expressed on each topic within large collections of textual feedback.
Recent advances in Natural Language Processing (NLP) have made it possible to perform detailed SA with high accuracy [
7]. Early approaches often relied on lexicons or traditional Machine Learning (ML) models with manually extracted features, and they struggled with the inherent complexity of natural language [
8]. The introduction of the transformer architecture in 2017 paved the way for a new generation of powerful NLP models [
9]. BERT (introduced in 2018) and its successors (e.g., RoBERTa) are able to learn rich language representations from massive text corpora and can be fine-tuned for tasks like SA with relatively few data [
10]. By capturing context and meaning through self-attention, these transformer-based models have achieved state-of-the-art performance in sentiment classification, often outperforming earlier methods. They can even detect mixed sentiments in a single sentence that contains both praise and criticism, a feat that was challenging for previous approaches. This contextual sensitivity makes transformers particularly suitable for ABSA, where distinguishing sentiments across different topics within the same text is essential.
Transformer models have indeed been applied successfully to aspect-level sentiment classification problems [
11]. By incorporating information, such as a given topic or target term, into the model’s input or by using attention mechanisms to focus on relevant portions of text, these models can determine the sentiment specific to each aspect mentioned. For instance, BERT-based architectures have been used for target-dependent SA, where the model learns to associate sentiment-laden words with the correct target entity in a sentence. This means that in a statement such as “The teacher’s feedback was great but the platform was unreliable,” the model can attribute the positive sentiment to “feedback” and the negative sentiment to “platform”. Such transformer-powered ABSA methods consistently outperform earlier approaches based on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) [
12]. They also eliminate the need for extensive feature engineering or separate aspect-extraction steps, since the model can implicitly learn to attend to different topics and sentiment indicators as part of its training.
Recent research has emphasized the growing importance of developing NLP applications for low-resource languages, highlighting the lack of robust tools and annotated datasets for complex tasks such as SA [
13]. As a result, besides English, fine-grained SA is less explored. In Greek, for instance, annotated datasets for SA are limited, and only recently have transformer models become available for the language [
14]. Despite these limitations, there has been a growing interest in exploring SA for Greek, especially in domain-specific contexts like social media, product reviews, and educational discussions [
15]. In the absence of language-specific models, researchers must rely on multilingual models that include Greek in their training (such as mBERT or XLM-RoBERTa). Applying TBSA to Greek educational interviews constitutes a novel and understudied undertaking, especially given the scarcity of prior work using qualitative stakeholder data in Greek. TBSA, in the educational domain, has its own complexities: the language used by the educational community (teachers, students, parents) when discussing their schooling experiences differs from the language in product reviews or social media posts, which are common sources for training sentiment models. These gaps in language resource availability and domain-specific usage motivate the evaluation of state-of-the-art transformer models focused in this setting.
In this study, a comparative evaluation of transformer language models for TBSA on ERT interview data is presented. Several models, including both multilingual transformers and models pretrained specifically on Greek data, are evaluated to assess how well they capture topic-level sentiments in a low-resource educational context.
Current research addresses the following research questions:
RQ1: How effectively can transformer-based language models detect sentiment in educational related qualitative data?
RQ2: How does model performance vary across different groups of stakeholders: parents, school directors, and teachers?
RQ3: To what extent does the inclusion of topic context influence the accuracy of sentiment detection?
RQ4: Which transformer-based model achieves the highest performance in identifying both positive and negative sentiment in this domain?
This study contributes to the field of Sentiment Analysis (SA) utilizing Greek-language interview data. Specifically, it introduces three original datasets reflecting diverse perspectives from various educational stakeholders, each explicitly annotated for topic and sentiment (i). Additionally, it constitutes the first application of TBSA to interviews conducted within the Greek educational context (ii). Furthermore, this study provides a comparative evaluation of four transformer-based language models, namely GreekBERT, XLM-r-Greek, mBERT, and Palobert, focusing on both overall Sentiment Analysis and performance within specific topics (iii). Moreover, it demonstrates that models pretrained specifically on Greek outperform multilingual counterparts, especially in accurately identifying negative sentiments and interpreting responses containing nuanced emotional cues (iv). Finally, it offers empirical evidence highlighting prevalent negative emotional responses toward ERT, thus informing potential pedagogical interventions and future emotional support frameworks (v). These contributions offer valuable resources and insights for advancing SA in Greek-language corpora and support the development of more accurate and context-sensitive language models.
3. Materials and Methods
The current study builds upon prior work conducted by the same research team in the field of NLP for educational data analysis in Modern Greek. Our previous study focused on ERT and its impact on students with functional diversity during the COVID-19 pandemic [
68]. In this work, the research team applied classical ML algorithms, including NB and SVM, to perform topic identification on interview transcripts collected from educational stakeholders. Using TF-IDF representations, the model successfully categorized key thematic challenges related to pedagogical barriers, emotional strain, and technological limitations. That study demonstrated the potential of traditional ML methods for structured classification of qualitative educational discourse in low-resource language contexts like Greek. Building directly on this foundation, our research team extended the methodological framework by incorporating transformer-based architectures for topic classification on a broader interview dataset [
69]. That research evaluated both multilingual and language-specific models, namely mBERT, XLM-R Greek, and GreekBERT, across a multi-class classification task aligned with the educational dimensions of ERT. Among the models tested, GreekBERT produced the highest F1 score of 0.76, significantly outperforming both classical ML models and multilingual transformers. These results confirmed the value of domain-specific fine-tuning and contextual embedding for capturing thematic content in Modern Greek educational narratives. The combined results of these prior studies form the conceptual and methodological basis of the present research, serving as a foundation for advancing from topic classification to TBSA of stakeholder perspectives on ERT.
3.1. Methodological Framework
All steps followed in this study, from dataset preparation to model evaluation and result visualization, are summarized in
Figure 1.
The SA process was implemented using the Python (version 3.12) programming language, and relied on a combination of PyTorch (version 1.0.2), Hugging Face Transformers (version 4.50.0), and Scikit-learn (version 1.6.1) to build, train, and evaluate classification models on Modern Greek text [
70]. The process began by loading and preprocessing a labeled dataset consisting of user-generated text entries annotated with binary sentiment values (0 = negative, 1 = positive). Texts were also annotated with predefined semantic topics, allowing for downstream topic-based performance evaluation. Neutral sentiments, when present, were excluded in order to maintain a binary classification setup. The impact of the neutral sentiment exclusion is analyzed in detail in the next subsection.
3.2. Dataset Collection and Annotation
The current study utilizes qualitative datasets obtained through semi-structured interviews to explore the experiences of ERT among three primary groups: (i) Parents of Students with Functional Diversity, (ii) School Directors, and (iii) Teachers. Data consist of diverse demographics and include participants from urban, suburban, and rural settings across mainland and island regions, providing a holistic perspective. These datasets, gathered from 12 Parents of Students with Functional Diversities (PSFD dataset), 15 School Directors (SCH dataset), and 15 Teachers (TCH dataset), were meticulously constructed from interviews in the Modern Greek language. All interviews adhered to ethical standards, and participants provided informed consent in accordance with GDPR regulations. The interviews were designed to elicit detailed narratives around participants’ lived experiences with ERT. Common recurring expressions included frustration with digital platforms, emotional isolation, and satisfaction with asynchronous material availability. These lexical patterns informed the topic annotation process and justified the application of TBSA over document-level Sentiment Analysis. Datasets and research protocols are available under the Creative Commons Attribution Non-Commercial No Derivatives 4.0 International license (CC BY-NC-ND 4.0).
Each dataset was labeled into four predefined topics: (i) Material and Technical Conditions, (ii) Educational Dimension, (iii) Psychological/Emotional Dimension, and (iv) Learning Difficulties alongside ERT. The data reflect a multi-class structure and were domain-specific, capturing the detailed experiences of the participants. Each dataset was annotated at the sentence level by two independent annotators, who assigned each sentence to one of four predefined thematic categories-topics based on its semantic content. In addition to topic classification, each sentence was also labeled with a sentiment value as positive, negative, or neutral. The first annotator was an internal member of the research team with a thorough understanding of the project and the annotation protocol. The second annotator was external to the research team and had expertise in the field. This annotator followed the provided annotation guidelines to assign both topic and sentiment labels. To ensure consistency and clarity in the process, illustrative examples for each topic and sentiment category were included in the annotation instructions. The annotation process resulted in a high level of agreement between annotators, with an overall agreement rate of 93%. To account for agreement occurring by chance, inter-annotator reliability was further assessed using Cohen’s Kappa coefficient, yielding a score of
[
71]. In cases where disagreements occurred, a third experienced annotator with relevant research background was consulted to make the final decision and resolve any ambiguity.
3.3. Preprocessing and Class Imbalance Handling
Several studies have highlighted that including a neutral sentiment class with very few examples can hinder model performance. More specifically, Valdivia et al. [
72] reported that neutral reviews were often omitted due to their ambiguity and lack of information, treating the neutral class as noise to improve binary (positive/negative) classification results. Likewise, Shahzad et al. [
73] found that a binary sentiment classifier achieved much higher accuracy than a three-class (positive/neutral/negative) model when neutral examples were underrepresented, indicating better generalization without the neutral class. In small or imbalanced datasets where neutral instances were rare, researchers observed that standard algorithms struggle with the neutral class; in fact, removing the neutral category can yield significantly improved performance in such cases [
74]. These findings suggest that when neutral samples were limited or add ambiguity, focusing on binary sentiment polarity was often more effective. In this study, the neutral sentiment class was excluded due to its low representation and limited contribution to model learning, allowing for a more effective and focused binary sentiment classification. Preliminary experiments including the neutral class resulted in lower overall performance compared to the binary setting, confirming the benefit of its exclusion.
The combined pie charts in
Figure 2 illustrate the sentiment distribution in the dataset before and after preprocessing, providing a clear comparison of the data transformation. In the first chart, representing the dataset before preprocessing, three sentiment categories are displayed: positive, negative, and neutral. The neutral sentiment accounted for 9.9% of the dataset, reflecting where the sentiment was either indeterminate or instances lacked emotional polarity. Positive and negative sentiments were dominant, constituting 28.0% and 62.1% of the data, respectively, highlighting a significant prevalence of negative sentiment at this stage.
The second chart, representing the dataset after preprocessing, excludes the neutral sentiment category as part of data cleaning, leaving only positive and negative sentiments. This refinement resulted in 31.3% positive and 68.7% negative sentiments, underscoring the dominance of the negative sentiment.
The exclusion of the neutral class not only simplifies the dataset, but also enhances the clarity and focus of the subsequent SA. The change in distribution reflects the dataset’s alignment with binary sentiment classification needs, particularly for tasks leveraging transformer-based models, which typically operate more effectively with distinct sentiment labels. This transformation underscores the critical importance of preprocessing in ensuring that the dataset aligns with the methodological requirements of sentiment classification. Furthermore, the results demonstrate the high proportion of negative sentiments in the data, suggesting potential challenges in classification balance and emphasizing the need for strategies, like class weighting. This preprocessing step is vital for SA and allows for more focused and interpretable results.
For data handling and preprocessing, the Pandas library was used to clean, organize, and split the dataset into training and test subsets, typically using an 80/20 split ratio. Given the small size of our dataset, we adopted a repeated holdout validation strategy with five independent 80/20 train–test splits and average performance, rather than using full k-fold cross-validation, which may produce unstable or biased estimates on limited data [
75,
76]. To support transformer-based training, the Hugging Face AutoTokenizer was employed to tokenize the textual data while respecting maximum sequence lengths and applying truncation and padding. A custom Dataset class, compatible with PyTorch’s torch.utils.data.Dataset, was defined to manage the tokenized inputs and associated sentiment labels efficiently. The tokenized data was then uploaded to the Hugging Face Trainer API for model training.
The interview corpus underwent detailed preprocessing to ensure compatibility with model requirements. Categorical labels for TC and SA were converted to numerical formats. The neutral sentiment category was excluded to emphasize binary sentiment classification.
To address potential class imbalance, class weights were computed dynamically from the training set and incorporated into the training process through the model’s loss function, CrossEntropyLoss. Training arguments were specified using Hugging Face’s TrainingArguments, which allowed for the configuration of hyperparameters such as learning rate, batch size, number of epochs, weight decay, evaluation strategy, and checkpoint saving. During training, the best-performing model, according to evaluation metrics such as F1 score, was automatically selected.
3.4. Sentiment and Topic Class Distribution Across Datasets
The PSFD dataset includes 831 sentences, the majority of them (72%) labeled as negative. The most frequent topic is Class 2, followed by Class 3 and Class 4. Regarding the SCH dataset, most sentences are related to Class 1, showing a focus on material and technical conditions. The TCH dataset contains the largest number of sentences overall, with almost equal representation of Class 1 and Class 2. In all datasets, negative sentiment appears more often than positive. This distribution of sentences, summarized in
Table 2, reveals important differences in both sentiment and topic focus across the three datasets. A visual summary of the topic distribution across datasets is provided in
Figure 3.
3.5. Model Selection & Evaluation
The models selected for this study are state-of-the-art transformer-based architectures derived from BERT, a leading approach in NLP that significantly outperforms conventional ML methods in various text analysis tasks, including SA and text classification [
77]. Large Language Models (LLMs) were not used in this study due to the small size and domain-specific nature of our dataset. Prior research has shown that LLMs often yield suboptimal or unstable results when fine-tuned on limited data [
78,
79], a finding supported by our preliminary experiments with available Greek-compatible LLMs. Models were specifically chosen because they support the Modern Greek language (mBERT), either exclusively or as a part of multilingual training (GreekBERT, XLM-r-Greek, PaloBERT). Greek-specific models such as Greek-BERT [
80] and PaloBERT [
81], along with their sentiment-specific fine-tuned variants, PaloBERT Sentiment [
81] and Greek-BERT Sentiment [
82], provide specialized performance due to their pre-training and fine-tuning on Greek text datasets. Additionally, multilingual models based on the XLM-RoBERTa architecture [
83] have demonstrated efficiency when specifically fine-tuned for Greek SA, thus combining broad linguistic coverage with task-specific precision. Models like Multilingual BERT (mBERT) [
84] and XLM-RoBERTa Base [
83], while not fine-tuned for SA, offer a multilingual foundation and can achieve excellent results, upon fine-tuning, in Greek sentiment datasets. The flexibility of these models makes them highly suitable for conducting accurate SA research in Greek, ensuring reliable insights into the sentiments expressed in Greek textual data.
GreekBERT model is a Greek-specific adaptation of Google’s BERT language model [
80]. It was trained using Google’s official BERT codebase and was subsequently converted for compatibility with PyTorch and TensorFlow using Hugging Face’s conversion scripts. This model comprises 12 layers, 768 hidden units, and 12 attention heads, totaling approximately 110 million parameters. Training involved 1 million steps with batch sizes of 256 sequences of length 512, using a learning rate of
on Google Cloud TPU v3-8 hardware provided by TensorFlow Research Cloud (TFRC) and GCP research credits.
The XLM-r-Greek model is a specialized Cross-Encoder designed for Greek Natural Language Inference (NLI) tasks and zero-shot classification [
85]. Developed jointly by the Hellenic Army Academy and the Technical University of Crete, it utilizes the SentenceTransformers’ Cross-Encoder class. The model was trained on the multilingual AllNLI dataset, incorporating Greek data generated through EN2EL neural machine translation. It outputs classification scores indicating “contradiction,” “entailment,” and “neutral” categories, and is also applicable in zero-shot classification scenarios, assessing the likelihood that sentences belong to provided labels or topics.
PaloBERT is a RoBERTa-based Greek language model specifically trained on social media content [
81]. Its training corpus comprised 458,293 documents sourced from various Greek social media accounts. Additionally, a GPT-2 tokenizer was trained from scratch using the same dataset.
These models were selected primarily because they are pretrained specifically on Greek language data, ensuring high linguistic relevance and accuracy for processing Greek text. Specifically, GreekBERT provides high accuracy for topic classification and SA tasks, while XLM-r-Greek demonstrates versatility and efficacy in inference and unsupervised classification scenarios. PaloBERT, trained explicitly on Greek social media content, offers exceptional suitability for SA and information extraction from social media platforms.
All model-specific training parameters are detailed in
Table A1. After training, the model was evaluated on the test set. Predictions were generated and compared against the sentiment labels to compute performance metrics using Scikit-learn, specifically using classification report, confusion matrix, and precision, recall, and F1 support functions. Additionally, ROC-AUC curves were plotted, while the confusion matrix was visualized using the seaborn heatmap utility [
86]. All evaluation metrics, including F1 score, precision, and recall (per class), were exported for documentation and further analysis.
5. Discussion
5.1. Findings
The results from
Table 3 highlight the consistent superiority of GreekBERT in the PSFD sentiment classification task. Its strong performance across both classes, and particularly in the negative class, demonstrates its capacity to generalize well to the language patterns present in this dataset. A key observation is that all models performed significantly better on negative sentiment detection compared to positive sentiment. Negative class metrics, especially recall and F1 score, were consistently higher across models. This suggests that negative expressions in the PSFD dataset are more distinct, possibly due to clearer lexical cues or more consistent syntactic patterns, whereas positive expressions may be more subtle or diverse, leading to reduced recall and F1 scores. These findings point to a general challenge in the detection of positive sentiment, indicating the need for further research into data balancing or class-specific optimization strategies.
As shown in
Table 4, GreekBERT consistently outperformed the other models in the SCH dataset. Its nearly equal and high scores in both sentiment classes suggest strong generalization ability. Interestingly, unlike the PSFD dataset, where performance on the negative class was noticeably higher, the SCH dataset results are more balanced. GreekBERT, XLM-r-Greek, and even mBERT maintained solid scores for both positive and negative predictions. However, Palobert lagged behind, especially in recall for the positive class, which limited its overall performance. The relatively balanced results in SCH may reflect the dataset’s structure or linguistic clarity across sentiment types. This suggests that the nature of the dataset plays a critical role in sentiment model performance and highlights the consistency of GreekBERT across different textual domains.
The results in
Table 5 show that both GreekBERT and XLM-r-Greek are highly effective on the TCH dataset, although they display different behavior across sentiment classes. GreekBERT was particularly strong in detecting negative sentiment, achieving the highest recall and F1 score in that class. However, its lower recall in the positive class (0.60) reduced its balance. In contrast, XLM-r-Greek maintained more consistent performance across both classes, suggesting greater class balance sensitivity. As observed in the PSFD dataset, all models performed better on the negative class than on the positive class, especially in terms of recall. This recurring trend suggests that expressions of negative sentiment may be more linguistically distinct or more consistently annotated across datasets. The TCH results reinforce the conclusion that GreekBERT and XLM-r-Greek are reliable models, although further work may be needed to enhance performance in positive sentiment detection.
These findings have practical implications for educational policy. For instance, the consistent detectability of emotional distress suggests that automated tools could support mental health monitoring in educational settings, helping educators identify emotional challenges in student or teacher narratives. The results in
Table 7 reveal that topic class plays a critical role in sentiment classification performance. Notably, Class 3 (Psychological/Emotional Dimension) yielded the highest scores in both accuracy and F1 metrics, across multiple models. This trend suggests that sentiment polarity is more easily distinguishable in emotionally charged content. On the other hand, Class 4 (Learning Difficulties and ERT) showed lower F1 scores for positive sentiment, particularly for Palobert (16.7%), indicating greater difficulty in identifying positive expressions within that context. These findings highlight the need for topic-sensitive evaluation in SA and suggest that model performance cannot be fully understood without considering the thematic content of the text. These topics include less frequently discussed or highly specialized concepts, such as “parallel support teacher” or “attention deficit”, which are less common in general-purpose language model pretraining. Additionally, Learning Difficulties is the third most frequent category out of the four, making it relatively underrepresented in the dataset. This imbalance may limit the model’s exposure to sufficient examples during training. The difficulty in predicting specific classes may also stem from the limited number of annotated examples in those categories, which restricts the model’s ability to generalize effectively. Transformer-based models tend to perform more reliably with larger and more balanced datasets, and performance may be affected when topic-specific data are sparse or linguistically complex.
As shown in
Table 8, the SCH dataset reveals clear differences in model behavior across thematic categories. GreekBERT excelled in identifying both positive and negative sentiments in technical and educational contexts (Classes 1 and 2), while mBERT showed notable performance in Class 4, suggesting its strength in detecting sentiment related to learning challenges. The highest F1 scores were observed in more structured or emotionally salient categories, whereas performance dropped in categories with more ambiguity, such as Psychological/Emotional content (Class 3). These results support the view that topic structure and emotional clarity affect sentiment detection and highlight the importance of evaluating model performance through a topic-based lens.
The topic-specific results on the TCH dataset (
Table 9) reinforce the trend seen in other datasets: sentiment in psychologically or emotionally loaded topics (Class 3) is more reliably classified by BERT-based models. GreekBERT and mBERT excelled in this class, achieving the highest accuracy and F1 scores across all categories. In contrast, Class 2 (Educational Dimension) was more challenging, particularly for Palobert and mBERT in terms of positive sentiment identification. These findings suggest that model performance is highly sensitive to the thematic domain and support the integration of topic-aware evaluation in SA pipelines.
The F1 score comparison highlights a common trend across multilingual BERT-based models. GreekBERT’s strong performance, especially in negative sentiment detection, suggests that it captures polarity features more effectively in Greek-language content. XLM-r-Greek also performed reliably, indicating its consistency across domains. In contrast, PaloBERT’s relatively poor results may be attributed to limitations in pretraining or domain mismatch. One limitation of this study is the relatively small number of annotated interview samples, which may restrict the generalizability of the findings to broader educational populations. Moreover, the exclusive use of binary sentiment labels (positive/negative) excludes neutral responses that may carry valuable nuance, particularly in emotionally mixed or context-sensitive statements. Future work could explore more granular sentiment categories or continuous sentiment scales to better capture the subtleties in stakeholder feedback. The higher F1 scores in negative sentiment compared to positive across most models reinforce previous observations about the asymmetry in sentiment expression in the datasets. Additionally, the heatmap presented in
Figure 4 further illustrates these differences, showing that GreekBERT achieved the highest F1 scores across both sentiment classes and overall performance. These findings highlight the importance of using domain-specific or language-adapted models for sentiment classification tasks in underrepresented languages such as Greek.
The ROC AUC analysis (
Figure 5) reinforces earlier performance trends observed in precision, recall, and F1 scores. GreekBERT’s consistently high AUC demonstrates its effective generalization across datasets, likely due to its training on Greek-specific corpora. XLM-r-Greek also exhibits reliable performance, benefiting from multilingual contextualization. On the contrary, mBERT’s moderate AUC suggests less effective representation for Greek, while PaloBERT’s lower score highlights potential issues in handling sentiment polarity. These differences may stem from model architecture, training data diversity, or domain mismatch. Although Palobert is pre-trained on Greek social media data, its performance was comparatively lower. This can be attributed to a linguistic mismatch, as the dataset consists of verbal interview transcripts that differ substantially in style, structure, and register from the informal and often fragmented language typically found in social media. Consequently, the representations learned by Palobert may not generalize effectively to the more formal, discourse-rich nature of the data, which likely contributes to its reduced effectiveness in the classification tasks.
It is also worth noting that ROC AUC values are consistently higher than the corresponding overall F1 scores for each model. This is expected, as AUC measures the model’s ability to discriminate between classes across all classification thresholds, while the F1 score is bound to a single decision point and is more sensitive to class imbalances and specific misclassifications. Including both metrics provides a more comprehensive evaluation: F1 reflects real-world performance at a specific threshold, and AUC reveals the overall classification potential. This distinction further validates GreekBERT’s superior performance, as it maintains high scores in both metrics.
The highest performance was observed when sentiment classification was applied without reference to specific topics, indicating that the models perform better when sentiment is analyzed independently of thematic topic categories. In addition, combining the datasets as one input led to improved results because the larger and more diverse set of training examples helped the models generalize more effectively across sentiment classes.
5.2. Error Analysis
This section presents illustrative examples of classification errors related to both sentiment and topic predictions. While many model outputs aligned well with human annotations, several misclassifications reveal limitations in capturing contextual nuances and semantic subtleties.
As shown in
Table 11, the model performs well when lexical cues are unambiguous and align with the overall tone or context of the sentence.
The first two examples reflect successful predictions in both the sentiment and topic dimensions, demonstrating the model’s ability to handle clear lexical cues and a consistent tone. The following two misclassified examples are examined to identify the linguistic and contextual factors that may have led to the model’s errors.
Example 1: «Δεν σας κρύβω πως πανικοβλήθηκα.» Translation: “I won’t hide from you that I panicked.” This sentence was correctly classified as having negative sentiment, but was misclassified in terms of topic. Although it expresses emotional distress, the model assigned it to the topic of Material and Technical Conditions, likely due to the presence of the word «πανικός» (panic), which may co-occur with technical issues in other examples. This indicates semantic overlap between thematic categories and highlights challenges in fine-grained topic differentiation.
Example 2: «Ήταν μια πολύτιμη εμπειρία που μας έδειξε πόσο τεχνολογικά απροετοίμαστο ήταν το σύστημα και πόσο μόνοι μας ήμασταν τελικά.» Translation: “It was a valuable experience that showed us how technologically unprepared the system was and how alone we truly were.” Although the sentence begins with the positive phrase «πολύτιμη εμπειρία» (valuable experience), its overall tone is clearly negative. The model misclassified it as positive sentiment, likely influenced by the surface-level lexical cue. This example highlights the model’s difficulty in recognizing irony, contradictory expressions, or shifts in emotional tone within the same sentence.
The observed misclassifications are primarily attributed to the following:
Overlapping contextual features across topic categories.
The complexity of emotional expression, particularly in cases involving mixed or implicit sentiments.
The influence of misleading keywords that override deeper semantic interpretation.
These examples emphasize the importance of enriching the training dataset with more contextually complex instances in order to improve the model’s sensitivity to subtle linguistic cues and discourse-level meaning.
5.3. Comparison with Relevant Research
While
Section 2.3 and
Section 2.4 provided a broader overview of relevant studies, the five works selected for
Table 12 represent a focused subset chosen for their strong relevance and close alignment with our study in terms of language, model architecture, and task design.
Michailidis [
15] applied GreekBERT to product reviews from a Greek e-commerce platform, achieving an F1 of 0.96 and surpassing both traditional and neural baselines. The dataset included binary sentiment annotations and showed the power of BERT fine-tuning even in comparison to Large Language Models like GPT-4. GreekBERT achieved significantly higher classification performance than traditional Machine Learning models in Greek product review Sentiment Analysis, underscoring the advantages of transformer-based architectures for morphologically rich languages.
Chatzimina et al. [
45] focused on clinical Greek dialogues, classifying utterances into three sentiment categories. BERT outperformed other transformers (e.g., RoBERTa, XLNet), reaching a macro-F1 of 0.95, highlighting BERT’s capacity to capture emotion in health contexts.
Bilianos [
53] evaluated Greek product reviews using Greek BERT with an SVM classifier, reporting an F1 of 0.97, despite a small dataset (480 reviews). This demonstrated BERT’s effectiveness even with minimal data and highlighted the impact of transformer-based models in consumer review analysis.
Patsiouras et al. [
55] worked on Greek political tweets using GreekBERT with data augmentation. They classified tweets into three sentiment classes and reached a 0.83 F1, showing reliability in a domain with subjectivity and class imbalance. Through extensive evaluation with Deep Neural Networks and data augmentation, the study identified strategies for each sentiment category, offering a benchmark for future research in Greek political SA.
Katika et al. [
56] combined topic modeling and SA on Greek tweets about Long COVID. A fine-tuned GreekBERT model reached and aligned well with manual annotations. The study revealed that domain-tuned models like Greek-BERT can effectively capture public health concerns from social media, achieving 94% accuracy.
In our study (2025), several BERT-based models were evaluated on topic-specific sentiment classification. GreekBERT performed best, achieving an F1 score of 0.91 in the best case. The analysis confirmed that topic type significantly affects performance, with Class 3 (Psychological/Emotional) being the most predictable. Compared to prior work [
15,
45,
53,
55,
56], our study differs in its focus on a multi-topic domain-specific dataset derived from the Greek educational sector, whereas previous studies predominantly relied on product reviews, tweets, or clinical dialogues. Furthermore, our analysis includes a systematic comparison across topic classes, providing insights into how different themes influence sentiment prediction performance. Despite these methodological differences, our best F1 score (0.91) is comparable to those reported in prior studies, highlighting that BERT-based models can achieve competitive performance even in complex and diversified domains such as education.
This study makes several contributions to the field of educational NLP, specifically in the context of Emergency Remote Teaching. First, this analysis applied transformer based architectures, with an emphasis on Greek specific models, enabling the handling of sentiment in a low-resource language. These models were evaluated in a multi-class domain-specific setting across three manually annotated datasets, offering new insights into performance under educational and linguistic constraints (RQ1). The evaluation further revealed performance differences across stakeholder groups such as parents, school directors, and teachers (RQ2). A TBSA approach was implemented on Modern Greek interview corpora, capturing both the thematic and emotional aspects of the Emergency Remote Teaching experience. The Sentiment Analysis was topic-level, aligned with predefined thematic classes. However, the inclusion of topic context did not improve classification accuracy, as models performed better in sentiment-only settings due to the limited number of topic-specific examples and the data demands of transformer-based models (RQ3). GreekBERT achieved the highest accuracy in identifying both positive and negative sentiment within the domain (RQ4). Third, this study contributes a Greek corpus of manually segmented and labeled interviews, annotated for both sentiment and topic, enabling future research using low-resource education-oriented Sentiment Analyses. Finally, the findings expose the emotional dynamics experienced by teachers during Emergency Remote Teaching, providing empirical grounding for pedagogical planning and support systems in similar crisis-driven contexts.
5.4. Limitations
While the results of this study are promising, several limitations should be acknowledged. The Greek datasets used reflect authentic feedback from educational stakeholders within the context of ERT. Although the dataset size may seem limited for typical ML benchmarks, it offers unique value, as it comprises real-world domain-specific content collected under natural conditions. The scope of this study remains constrained by the limited size of the available dataset, which reflects the challenges involved in collecting large-scale and high-quality Greek-language data in specialized educational domains. Nonetheless, this type of data ensures rich linguistic context and authentic sentiment expression, which are often absent in large-scale generic corpora.
Nonetheless, the limited availability of high-quality Greek-language datasets with varying themes and writing styles hinders the broader development and evaluation of general-purpose text-classification models. Another important limitation is the reliance on existing pre-trained models. While models such as GreekBERT have shown strong results, their performance could potentially improve if they are fine-tuned on larger and more thematically varied corpora that better reflect the linguistic patterns and terminology of the Greek educational context.
GreekBERT and similar transformer-based models are typically pre-trained on general-domain corpora like Wikipedia, the European Parliament Proceedings, and OSCAR. As a result, they may struggle to fully grasp the details and terminology specific to the educational sector. Additionally, this study’s findings are inherently linked to the quality and representativeness of the datasets used, which could impact the generalizability of the results. Finally, given the limited availability of Greek-specific language models, further validation with new datasets and future pre-trained Greek transformers will be necessary to establish the robustness and reliability of these findings across broader applications.
6. Conclusions and Future Work
This study explored the application of transformer models for TBSA in Greek-language interview data within an educational context. By focusing on topic-aware sentiment classification, it addressed how model performance varies across different stakeholder groups and thematic content. The results demonstrate that GreekBERT consistently outperformed the other models, particularly in identifying negative sentiments and processing emotionally sensitive content. The introduction of three original datasets and the comparative evaluation of four multilingual and Greek-specific models provided new resources and insights for the field. These findings highlight the value of using context-specific models and carefully designed datasets when analyzing sentiment in low-resource languages. Overall, this study contributes to the development of more effective SA tools for socially and linguistically complex domains.
Future work could explore several directions to build on the findings of this study. One possibility is to expand the dataset by including more interviews, especially from other groups such as students or education policymakers, to capture a wider range of views. Another promising direction is to move beyond simple positive and negative labels by using more detailed sentiment categories such as neutral or mixed emotions, or even scoring the intensity of each response. It may also be valuable to look at how sentiment changes throughout the course of an interview, especially in response to emotionally charged topics. Adding contextual details, such as the background of the speaker or the timing and structure of the interview, could help models better understand what influences emotional expression. Finally, future studies might look at how these methods can be adapted for other low-resource languages or use in multilingual settings through transfer learning techniques. The integration of instruction-tuned or adapter-based LLMs is also part of our planned research agenda to be pursued once current resource limitations are addressed. We also plan to experiment with ensemble methods to improve overall prediction robustness and model stability.