Next Article in Journal
High-Performance All-Optical Logic Gates Based on Silicon Racetrack and Microring Resonators
Previous Article in Journal
BCA-MVSNet: Integrating BIFPN and CA for Enhanced Detail Texture in Multi-View Stereo Reconstruction
Previous Article in Special Issue
Automatic Detection of the CaRS Framework in Scholarly Writing Using Natural Language Processing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Evaluation of Transformer-Based Language Models for Topic-Based Sentiment Analysis

by
Spyridon Tzimiris
,
Stefanos Nikiforos
*,
Maria Nefeli Nikiforos
,
Despoina Mouratidis
and
Katia Lida Kermanidis
Humanistic and Social Informatics Laboratory, Department of Informatics, Ionian University, 49100 Corfu, Greece
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(15), 2957; https://doi.org/10.3390/electronics14152957
Submission received: 12 June 2025 / Revised: 16 July 2025 / Accepted: 23 July 2025 / Published: 24 July 2025

Abstract

This research investigates topic-based sentiment classification in Greek educational-related data using transformer-based language models. A comparative evaluation is conducted on GreekBERT, XLM-r-Greek, mBERT, and Palobert using three original sentiment-annotated datasets representing parents of students with functional diversity, school directors, and teachers, each capturing diverse educational perspectives. The analysis examines both overall sentiment performance and topic-specific evaluations across four thematic classes: (i) Material and Technical Conditions, (ii) Educational Dimension, (iii) Psychological/Emotional Dimension, and (iv) Learning Difficulties and Emergency Remote Teaching. Results indicate that GreekBERT consistently outperforms other models, achieving the highest overall F1 score (0.91), particularly excelling in negative sentiment detection (F1 = 0.95) and showing robust performance for positive sentiment classification. The Psychological/Emotional Dimension emerged as the most reliably classified category, with GreekBERT and mBERT demonstrating notably high accuracy and F1 scores. Conversely, Learning Difficulties and Emergency Remote Teaching presented significant classification challenges, especially for Palobert. This study contributes significantly to the field of sentiment analysis with Greek-language data by introducing original annotated datasets, pioneering the application of topic-based sentiment analysis within the Greek educational context, and offering a comparative evaluation of transformer models. Additionally, it highlights the superior performance of Greek-pretrained models in capturing emotional detail, and provides empirical evidence of the negative emotional responses toward Emergency Remote Teaching.

1. Introduction

Sentiment Analysis (SA), the computational study of opinions and emotions in text, has become an essential tool for understanding public opinion across diverse domains [1]. It enables researchers and organizations to efficiently measure how people feel about products, services, policies, and events by analyzing large volumes of textual feedback. However, traditional SA typically treats an entire document or sentence as a single unit with one overall sentiment label (e.g., positive, negative, or neutral). This approach can omit important details in texts covering multiple topics. Aspect-Based Sentiment Analysis (ABSA) addresses this limitation by determining the sentiment with respect to specific aspects or topics mentioned within a text [2]. In other words, ABSA provides a fine-grained view of opinions by identifying what exactly each sentiment is about. For instance, a product review might praise the “battery life” of a laptop, but criticize its “screen” quality. While the review could be seen as ambivalent overall, an ABSA would reveal a positive sentiment towards the “battery life” aspect and a negative sentiment towards the “screen” aspect. This level of granularity is crucial for fully capturing complex sentiments. When the relevant aspects correspond to broader themes or subjects, this approach is often referred to as Topic-Based Sentiment Analysis (TBSA) [3,4]. The term “topics” is used in this study instead of “aspects” to more accurately describe the broader thematic categories addressed in the interviews, which involve complex educational dimensions rather than the simple product features typically analyzed in ABSA.
The need for TBSA becomes apparent in contexts where feedback encompasses diverse themes, such as education during crises [5]. A prominent recent example is Emergency Remote Teaching (ERT), the rapid shift to online instruction implemented as a temporary response to emergency situations like the COVID-19 pandemic. Unlike well-planned online courses, ERT was implemented on short notice, leading to a wide range of experiences and reactions from teachers, students, and parents [6]. Each stakeholder’s perspective on ERT touches on multiple aspects of the teaching and learning experience. For instance, a teacher might appreciate the flexibility of teaching from home, yet feel frustrated by reduced student engagement. Similarly, a student might enjoy the comfort of remote attendance, but struggle with technical difficulties. A school director might value the rapid implementation of digital tools during remote teaching, but simultaneously express concern over the lack of preparedness among staff and gaps in student performance. A single overall sentiment score for an interview or survey response in such cases would fail to capture these conflicting feelings. In contrast, a TBSA can identify sentiments tied to each issue (e.g., positive about flexibility, negative about engagement), offering a much richer understanding of the feedback. Analyzing interview transcripts with this fine-grained lens not only highlights which aspects of ERT were viewed positively or negatively, but also helps educators and policymakers identify specific areas of success or concern. However, manually extracting and evaluating sentiments on a per-topic basis from many in-depth interviews is labor-intensive and subject to human bias. This challenge underscores the importance of automated TBSA techniques to systematically and consistently interpret the sentiments expressed on each topic within large collections of textual feedback.
Recent advances in Natural Language Processing (NLP) have made it possible to perform detailed SA with high accuracy [7]. Early approaches often relied on lexicons or traditional Machine Learning (ML) models with manually extracted features, and they struggled with the inherent complexity of natural language [8]. The introduction of the transformer architecture in 2017 paved the way for a new generation of powerful NLP models [9]. BERT (introduced in 2018) and its successors (e.g., RoBERTa) are able to learn rich language representations from massive text corpora and can be fine-tuned for tasks like SA with relatively few data [10]. By capturing context and meaning through self-attention, these transformer-based models have achieved state-of-the-art performance in sentiment classification, often outperforming earlier methods. They can even detect mixed sentiments in a single sentence that contains both praise and criticism, a feat that was challenging for previous approaches. This contextual sensitivity makes transformers particularly suitable for ABSA, where distinguishing sentiments across different topics within the same text is essential.
Transformer models have indeed been applied successfully to aspect-level sentiment classification problems [11]. By incorporating information, such as a given topic or target term, into the model’s input or by using attention mechanisms to focus on relevant portions of text, these models can determine the sentiment specific to each aspect mentioned. For instance, BERT-based architectures have been used for target-dependent SA, where the model learns to associate sentiment-laden words with the correct target entity in a sentence. This means that in a statement such as “The teacher’s feedback was great but the platform was unreliable,” the model can attribute the positive sentiment to “feedback” and the negative sentiment to “platform”. Such transformer-powered ABSA methods consistently outperform earlier approaches based on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) [12]. They also eliminate the need for extensive feature engineering or separate aspect-extraction steps, since the model can implicitly learn to attend to different topics and sentiment indicators as part of its training.
Recent research has emphasized the growing importance of developing NLP applications for low-resource languages, highlighting the lack of robust tools and annotated datasets for complex tasks such as SA [13]. As a result, besides English, fine-grained SA is less explored. In Greek, for instance, annotated datasets for SA are limited, and only recently have transformer models become available for the language [14]. Despite these limitations, there has been a growing interest in exploring SA for Greek, especially in domain-specific contexts like social media, product reviews, and educational discussions [15]. In the absence of language-specific models, researchers must rely on multilingual models that include Greek in their training (such as mBERT or XLM-RoBERTa). Applying TBSA to Greek educational interviews constitutes a novel and understudied undertaking, especially given the scarcity of prior work using qualitative stakeholder data in Greek. TBSA, in the educational domain, has its own complexities: the language used by the educational community (teachers, students, parents) when discussing their schooling experiences differs from the language in product reviews or social media posts, which are common sources for training sentiment models. These gaps in language resource availability and domain-specific usage motivate the evaluation of state-of-the-art transformer models focused in this setting.
In this study, a comparative evaluation of transformer language models for TBSA on ERT interview data is presented. Several models, including both multilingual transformers and models pretrained specifically on Greek data, are evaluated to assess how well they capture topic-level sentiments in a low-resource educational context.
Current research addresses the following research questions:
  • RQ1: How effectively can transformer-based language models detect sentiment in educational related qualitative data?
  • RQ2: How does model performance vary across different groups of stakeholders: parents, school directors, and teachers?
  • RQ3: To what extent does the inclusion of topic context influence the accuracy of sentiment detection?
  • RQ4: Which transformer-based model achieves the highest performance in identifying both positive and negative sentiment in this domain?
This study contributes to the field of Sentiment Analysis (SA) utilizing Greek-language interview data. Specifically, it introduces three original datasets reflecting diverse perspectives from various educational stakeholders, each explicitly annotated for topic and sentiment (i). Additionally, it constitutes the first application of TBSA to interviews conducted within the Greek educational context (ii). Furthermore, this study provides a comparative evaluation of four transformer-based language models, namely GreekBERT, XLM-r-Greek, mBERT, and Palobert, focusing on both overall Sentiment Analysis and performance within specific topics (iii). Moreover, it demonstrates that models pretrained specifically on Greek outperform multilingual counterparts, especially in accurately identifying negative sentiments and interpreting responses containing nuanced emotional cues (iv). Finally, it offers empirical evidence highlighting prevalent negative emotional responses toward ERT, thus informing potential pedagogical interventions and future emotional support frameworks (v). These contributions offer valuable resources and insights for advancing SA in Greek-language corpora and support the development of more accurate and context-sensitive language models.

2. Related Work

2.1. Emergency Remote Teaching

The outbreak of the COVID-19 pandemic disrupted education on a global scale, prompting over 90% of students worldwide—approximately 1.6 billion learners—to experience school closures [16]. To ensure educational continuity, most institutions transitioned rapidly to ERT; a temporary shift to online instruction was implemented as a crisis response, rather than a well-planned digital education strategy [6,17]. While ERT enabled the continuation of the learning process, it also exposed systemic gaps in digital readiness, pedagogy, and infrastructure. [18,19].
International research highlighted unequal access to digital tools, insufficient training for educators, and widespread social and emotional challenges for students [20,21,22,23]. Particularly in households with limited technological or parental support, the shift often deepened existing educational inequalities [24]. Vulnerable groups, such as students with disabilities and functional diversity, faced significant barriers in participation and support [25,26,27].
In the case of Greece, the transition to ERT was abrupt and largely improvised, as no prior national infrastructure for large-scale distance education existed at the K–12 level [28,29]. The Ministry of Education issued emergency directives and deployed tools such as Cisco Webex, asynchronous platforms like Moodle, and educational television to reach students nationwide [30]. Despite those efforts, teachers reported limited preparedness, lack of pedagogical training, and technical challenges, particularly in remote areas [26,31,32]. Students often lacked devices or stable internet access, and many families struggled to support at-home learning, exacerbating social inequalities [22,24].
Overall, the Greek experience reflected broader international trends, revealing both the adaptive capacity of schools and the urgent need for systematic investment in digital education. ERT served as a temporary solution under crisis, but its limitations emphasized the importance of digital equity, teacher training, and inclusive strategies in future educational planning [19,33]. Given the wide-ranging impact of ERT, collecting and analyzing qualitative data, such as interviews with parents, teachers, and school directors, are essential for understanding the challenges faced during this period. These insights form a valuable basis for SA, helping to reveal key concerns, attitudes, and areas for improvement in future educational planning.

2.2. Sentiment Analysis

SA is a crucial field within NLP that focuses on identifying the sentiments or opinions expressed by individuals through various forms of communication, including interviews. As highlighted by Tan et al. [34], this discipline is becoming increasingly important due to the vast amount of user-generated content available across digital platforms, including in text, images, audio, and video formats. Their survey also emphasizes the diversity of datasets that support SA tasks across different domains and data types, which is essential for advancing research and practical applications.
SA addresses the task of analyzing sentiment conveyed in text, ranging from simple polarity classification (positive, negative, neutral) to more detailed emotion detection and opinion mining. The main types of SA include
  • Document-level SA, which determines the overall sentiment of an entire document;
  • Sentence-level SA, which classifies sentiment at the sentence level;
  • Aspect-Based Sentiment Analysis (ABSA), which identifies sentiment towards specific entities or aspects mentioned in the text;
  • Emotion detection, which goes beyond polarity classification to detect emotions such as happiness, anger, or sadness;
  • Comparative SA, which tracks sentiment trends over time;
  • Multilingual SA, which enables sentiment classification across different languages.
Three main approaches to SA can be distinguished:
  • Lexicon-based methods, which rely on sentiment dictionaries such as SentiWordNet, VADER, and AFINN to assign predefined sentiment scores to words and phrases;
  • ML-based methods, which use supervised and unsupervised learning algorithms like Naïve Bayes (NB), Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and transformer-based models such as BERT and RoBERTa to classify sentiment based on learned patterns;
  • Hybrid approaches, which combine lexicon-based and ML methods to improve performance, particularly in domains with limited labeled data.
Supervised learning-based SA heavily depends on labeled and annotated datasets, where human annotators classify text samples with sentiment labels, enabling models to learn sentiment patterns from examples; popular datasets include IMDB, SST, and Semantic Evaluation (SemEval) for English, while domain-specific datasets exist for finance, healthcare, and social media. Additionally, advanced learning paradigms enhance sentiment classification: zero-shot learning (ZSL) enables models to classify text without any prior labeled examples, leveraging pre-trained models and contextual understanding; few-shot learning (FSL) enhances adaptability by training on a small number of labeled samples, making it effective in low-resource settings; and one-shot learning (OSL) requires only a single labeled example per class to generalize sentiment detection. The adoption of Deep Learning (DL) techniques, particularly transformer models, has significantly improved SA performance across multiple domains, allowing for more accurate and context-aware sentiment classification in real-world applications. SA can be performed using various techniques and methodologies. These approaches, tailored to specific requirements and resources, include lexicon-based methods, ML-based models, and DL-based or ensemble learning frameworks [34].
Lexicon-based SA relies on predefined sentiment lexicons or dictionaries containing words annotated with their sentiment polarity (e.g., positive, negative, neutral). By matching the words in the text to these lexicons, a sentiment score is calculated. A sentiment lexicon has been utilized to analyze Greek tweets and hashtags, potentially yielding notable results [35]. Although simple, this method often struggles with understanding context, negations, or idiomatic expressions.
ML-based SA uses also supervised learning techniques to classify text into sentiment categories based on labeled training data. Common algorithms such as SVM, NB, and DT employ features like Bag-of-Words (BoW), Term Frequency–Inverse Document Frequency (TF-IDF), or n-grams. ML models has been applied effectively for Greek sentiment classification, particularly in the context of social media [36]. DL-based SA leverages neural network architectures like CNNs and RNNs, which automatically extract hierarchical features from raw text. These methods excel in capturing complex relationships, such as context and dependencies, that were difficult for traditional ML approaches to handle. State-of-the-art models, including BERT, have demonstrated superior performance in sentiment classification tasks [14,37].
ABSA offers a more granular approach by identifying sentiments related to specific aspects or attributes within a text, such as “price” or “quality” in product reviews [38]. ABSA combines sentiment classification with aspect detection techniques like topic modeling [39] or attention mechanisms [40]. Hybrid methods combine lexicon-based approaches with ML or DL models to mitigate the limitations of individual methodologies. For example, lexicon-derived features can complement feature extraction for ML or DL classifiers, enhancing their performance. These diverse methodologies underscore the adaptability of SA across various tasks and datasets, highlighting its potential for analyzing text in multiple domains and languages.

2.3. Sentiment Analysis in Greek Texts

Modern Greek, a low-resource language, has been of growing interest in SA research [14,36]. Table 1 provides a detailed overview of key studies in SA for the Greek language, including the datasets, methodologies and models used, and main findings and contributions.
Those works illustrate the progression of SA in Greek texts, setting the foundation for more detailed approaches such as TBSA, which is discussed in the following section.

2.4. Topic-Based Sentiment Analysis

TBSA focuses on identifying sentiments associated with broad themes or general topics, while ABSA targets sentiments toward specific attributes or features of entities. As mentioned above, the term “topics” is used in this study instead of “aspects” to more accurately reflect the broader thematic categories addressed in the interviews, which involve complex educational dimensions rather than the specific product features typically analyzed in ABSA. One key distinction between TBSA and ABSA lies in the abstraction level: while ABSA targets sentiment toward concrete entities (e.g., “battery life”), TBSA captures opinions on abstract or systemic topics (e.g., “emotional strain during online learning”). Recent research has also explored cross-domain ABSA by leveraging domain-invariant semantic features to improve generalization across contexts [59]. This makes TBSA more suitable for interpreting rich qualitative data, such as interview transcripts, where sentiment is dispersed across broader themes rather than isolated attributes.
Bordoloi and Biswas [60] presented a comprehensive survey on Sentiment Analysis, underlining the role of hybrid and transformer-based methods, and emphasized the future importance of combining topic modeling with SA across application domains. Moreover, Xiao et al. [61] proposed a novel framework that incorporates both cognitive and aesthetic sentiment causality to improve multimodal ABSA by aligning visual and textual content and reasoning about the underlying emotional drivers of sentiment expressions.
TBSA has been widely applied in the social media domain. TBSA on COVID-19-related tweets was conducted using Latent Dirichlet Allocation (LDA) for topic modeling and several Machine Learning (ML) classifiers, including SVM, Naive Bayes, and Decision Trees, with SVM performing best in capturing topic-specific sentiment trends [62]. A related framework for Twitter data showed that incorporating LDA-based topic information significantly improved classification accuracy compared to traditional models [63]. The combination of BiLSTM networks with LDA further demonstrated the effectiveness of deep learning in modeling topic–sentiment relationships within social media content [64]. In the educational domain, tweets about online learning during the COVID-19 pandemic were classified using LDA and ML methods, with Random Forest achieving the highest topic-level sentiment prediction accuracy [65].
In the political domain, a hybrid TBSA model was employed to predict election outcomes using Twitter data by integrating LDA with both lexicon-based and ML-based Sentiment Analysis, resulting in improved forecasting performance [66].
In public health, TBSA has been used to analyze Polish Facebook comments related to vaccine misinformation, where LDA and Lexicon-Based Sentiment Analysis revealed strong negative sentiment trends tied to specific misinformation topics [67].
TBSA methods commonly combine LDA for topic modeling with ML classifiers for SA. Abdulaziz et al. [62], Ficamos and Liu [63], and Mujahid et al. [65] used LDA with classifiers such as SVM, NB, DT, and RF, showing that TBSA improves performance. Bansal and Srivastava [66] followed a similar approach, integrating LDA with both lexicon-based techniques and ML models for election prediction, while Klimiuk et al. [67] applied LDA with Lexicon-Based SA to analyze sentiment around vaccine misinformation. Pathak et al. [64] extended this line of work by combining LDA with a BiLSTM model, demonstrating the potential of DL methods in capturing topic–sentiment dependencies.
A review of the existing literature reveals that the majority of TBSA research is focused on user generated content, primarily in social media posts such as tweets and Facebook comments, as well as product reviews [39]. However, this focus has left other data sources, such as interview transcripts, unexplored, particularly in low-resource languages like Greek. The lack of studies applying TBSA to Greek-language interviews, especially in the educational domain, highlights a significant research gap that this study aims to address.

3. Materials and Methods

The current study builds upon prior work conducted by the same research team in the field of NLP for educational data analysis in Modern Greek. Our previous study focused on ERT and its impact on students with functional diversity during the COVID-19 pandemic [68]. In this work, the research team applied classical ML algorithms, including NB and SVM, to perform topic identification on interview transcripts collected from educational stakeholders. Using TF-IDF representations, the model successfully categorized key thematic challenges related to pedagogical barriers, emotional strain, and technological limitations. That study demonstrated the potential of traditional ML methods for structured classification of qualitative educational discourse in low-resource language contexts like Greek. Building directly on this foundation, our research team extended the methodological framework by incorporating transformer-based architectures for topic classification on a broader interview dataset [69]. That research evaluated both multilingual and language-specific models, namely mBERT, XLM-R Greek, and GreekBERT, across a multi-class classification task aligned with the educational dimensions of ERT. Among the models tested, GreekBERT produced the highest F1 score of 0.76, significantly outperforming both classical ML models and multilingual transformers. These results confirmed the value of domain-specific fine-tuning and contextual embedding for capturing thematic content in Modern Greek educational narratives. The combined results of these prior studies form the conceptual and methodological basis of the present research, serving as a foundation for advancing from topic classification to TBSA of stakeholder perspectives on ERT.

3.1. Methodological Framework

All steps followed in this study, from dataset preparation to model evaluation and result visualization, are summarized in Figure 1.
The SA process was implemented using the Python (version 3.12) programming language, and relied on a combination of PyTorch (version 1.0.2), Hugging Face Transformers (version 4.50.0), and Scikit-learn (version 1.6.1) to build, train, and evaluate classification models on Modern Greek text [70]. The process began by loading and preprocessing a labeled dataset consisting of user-generated text entries annotated with binary sentiment values (0 = negative, 1 = positive). Texts were also annotated with predefined semantic topics, allowing for downstream topic-based performance evaluation. Neutral sentiments, when present, were excluded in order to maintain a binary classification setup. The impact of the neutral sentiment exclusion is analyzed in detail in the next subsection.

3.2. Dataset Collection and Annotation

The current study utilizes qualitative datasets obtained through semi-structured interviews to explore the experiences of ERT among three primary groups: (i) Parents of Students with Functional Diversity, (ii) School Directors, and (iii) Teachers. Data consist of diverse demographics and include participants from urban, suburban, and rural settings across mainland and island regions, providing a holistic perspective. These datasets, gathered from 12 Parents of Students with Functional Diversities (PSFD dataset), 15 School Directors (SCH dataset), and 15 Teachers (TCH dataset), were meticulously constructed from interviews in the Modern Greek language. All interviews adhered to ethical standards, and participants provided informed consent in accordance with GDPR regulations. The interviews were designed to elicit detailed narratives around participants’ lived experiences with ERT. Common recurring expressions included frustration with digital platforms, emotional isolation, and satisfaction with asynchronous material availability. These lexical patterns informed the topic annotation process and justified the application of TBSA over document-level Sentiment Analysis. Datasets and research protocols are available under the Creative Commons Attribution Non-Commercial No Derivatives 4.0 International license (CC BY-NC-ND 4.0).
Each dataset was labeled into four predefined topics: (i) Material and Technical Conditions, (ii) Educational Dimension, (iii) Psychological/Emotional Dimension, and (iv) Learning Difficulties alongside ERT. The data reflect a multi-class structure and were domain-specific, capturing the detailed experiences of the participants. Each dataset was annotated at the sentence level by two independent annotators, who assigned each sentence to one of four predefined thematic categories-topics based on its semantic content. In addition to topic classification, each sentence was also labeled with a sentiment value as positive, negative, or neutral. The first annotator was an internal member of the research team with a thorough understanding of the project and the annotation protocol. The second annotator was external to the research team and had expertise in the field. This annotator followed the provided annotation guidelines to assign both topic and sentiment labels. To ensure consistency and clarity in the process, illustrative examples for each topic and sentiment category were included in the annotation instructions. The annotation process resulted in a high level of agreement between annotators, with an overall agreement rate of 93%. To account for agreement occurring by chance, inter-annotator reliability was further assessed using Cohen’s Kappa coefficient, yielding a score of κ = 0.90 [71]. In cases where disagreements occurred, a third experienced annotator with relevant research background was consulted to make the final decision and resolve any ambiguity.

3.3. Preprocessing and Class Imbalance Handling

Several studies have highlighted that including a neutral sentiment class with very few examples can hinder model performance. More specifically, Valdivia et al. [72] reported that neutral reviews were often omitted due to their ambiguity and lack of information, treating the neutral class as noise to improve binary (positive/negative) classification results. Likewise, Shahzad et al. [73] found that a binary sentiment classifier achieved much higher accuracy than a three-class (positive/neutral/negative) model when neutral examples were underrepresented, indicating better generalization without the neutral class. In small or imbalanced datasets where neutral instances were rare, researchers observed that standard algorithms struggle with the neutral class; in fact, removing the neutral category can yield significantly improved performance in such cases [74]. These findings suggest that when neutral samples were limited or add ambiguity, focusing on binary sentiment polarity was often more effective. In this study, the neutral sentiment class was excluded due to its low representation and limited contribution to model learning, allowing for a more effective and focused binary sentiment classification. Preliminary experiments including the neutral class resulted in lower overall performance compared to the binary setting, confirming the benefit of its exclusion.
The combined pie charts in Figure 2 illustrate the sentiment distribution in the dataset before and after preprocessing, providing a clear comparison of the data transformation. In the first chart, representing the dataset before preprocessing, three sentiment categories are displayed: positive, negative, and neutral. The neutral sentiment accounted for 9.9% of the dataset, reflecting where the sentiment was either indeterminate or instances lacked emotional polarity. Positive and negative sentiments were dominant, constituting 28.0% and 62.1% of the data, respectively, highlighting a significant prevalence of negative sentiment at this stage.
The second chart, representing the dataset after preprocessing, excludes the neutral sentiment category as part of data cleaning, leaving only positive and negative sentiments. This refinement resulted in 31.3% positive and 68.7% negative sentiments, underscoring the dominance of the negative sentiment.
The exclusion of the neutral class not only simplifies the dataset, but also enhances the clarity and focus of the subsequent SA. The change in distribution reflects the dataset’s alignment with binary sentiment classification needs, particularly for tasks leveraging transformer-based models, which typically operate more effectively with distinct sentiment labels. This transformation underscores the critical importance of preprocessing in ensuring that the dataset aligns with the methodological requirements of sentiment classification. Furthermore, the results demonstrate the high proportion of negative sentiments in the data, suggesting potential challenges in classification balance and emphasizing the need for strategies, like class weighting. This preprocessing step is vital for SA and allows for more focused and interpretable results.
For data handling and preprocessing, the Pandas library was used to clean, organize, and split the dataset into training and test subsets, typically using an 80/20 split ratio. Given the small size of our dataset, we adopted a repeated holdout validation strategy with five independent 80/20 train–test splits and average performance, rather than using full k-fold cross-validation, which may produce unstable or biased estimates on limited data [75,76]. To support transformer-based training, the Hugging Face AutoTokenizer was employed to tokenize the textual data while respecting maximum sequence lengths and applying truncation and padding. A custom Dataset class, compatible with PyTorch’s torch.utils.data.Dataset, was defined to manage the tokenized inputs and associated sentiment labels efficiently. The tokenized data was then uploaded to the Hugging Face Trainer API for model training.
The interview corpus underwent detailed preprocessing to ensure compatibility with model requirements. Categorical labels for TC and SA were converted to numerical formats. The neutral sentiment category was excluded to emphasize binary sentiment classification.
To address potential class imbalance, class weights were computed dynamically from the training set and incorporated into the training process through the model’s loss function, CrossEntropyLoss. Training arguments were specified using Hugging Face’s TrainingArguments, which allowed for the configuration of hyperparameters such as learning rate, batch size, number of epochs, weight decay, evaluation strategy, and checkpoint saving. During training, the best-performing model, according to evaluation metrics such as F1 score, was automatically selected.

3.4. Sentiment and Topic Class Distribution Across Datasets

The PSFD dataset includes 831 sentences, the majority of them (72%) labeled as negative. The most frequent topic is Class 2, followed by Class 3 and Class 4. Regarding the SCH dataset, most sentences are related to Class 1, showing a focus on material and technical conditions. The TCH dataset contains the largest number of sentences overall, with almost equal representation of Class 1 and Class 2. In all datasets, negative sentiment appears more often than positive. This distribution of sentences, summarized in Table 2, reveals important differences in both sentiment and topic focus across the three datasets. A visual summary of the topic distribution across datasets is provided in Figure 3.

3.5. Model Selection & Evaluation

The models selected for this study are state-of-the-art transformer-based architectures derived from BERT, a leading approach in NLP that significantly outperforms conventional ML methods in various text analysis tasks, including SA and text classification [77]. Large Language Models (LLMs) were not used in this study due to the small size and domain-specific nature of our dataset. Prior research has shown that LLMs often yield suboptimal or unstable results when fine-tuned on limited data [78,79], a finding supported by our preliminary experiments with available Greek-compatible LLMs. Models were specifically chosen because they support the Modern Greek language (mBERT), either exclusively or as a part of multilingual training (GreekBERT, XLM-r-Greek, PaloBERT). Greek-specific models such as Greek-BERT [80] and PaloBERT [81], along with their sentiment-specific fine-tuned variants, PaloBERT Sentiment [81] and Greek-BERT Sentiment [82], provide specialized performance due to their pre-training and fine-tuning on Greek text datasets. Additionally, multilingual models based on the XLM-RoBERTa architecture [83] have demonstrated efficiency when specifically fine-tuned for Greek SA, thus combining broad linguistic coverage with task-specific precision. Models like Multilingual BERT (mBERT) [84] and XLM-RoBERTa Base [83], while not fine-tuned for SA, offer a multilingual foundation and can achieve excellent results, upon fine-tuning, in Greek sentiment datasets. The flexibility of these models makes them highly suitable for conducting accurate SA research in Greek, ensuring reliable insights into the sentiments expressed in Greek textual data.
GreekBERT model is a Greek-specific adaptation of Google’s BERT language model [80]. It was trained using Google’s official BERT codebase and was subsequently converted for compatibility with PyTorch and TensorFlow using Hugging Face’s conversion scripts. This model comprises 12 layers, 768 hidden units, and 12 attention heads, totaling approximately 110 million parameters. Training involved 1 million steps with batch sizes of 256 sequences of length 512, using a learning rate of 1 × 10 4 on Google Cloud TPU v3-8 hardware provided by TensorFlow Research Cloud (TFRC) and GCP research credits.
The XLM-r-Greek model is a specialized Cross-Encoder designed for Greek Natural Language Inference (NLI) tasks and zero-shot classification [85]. Developed jointly by the Hellenic Army Academy and the Technical University of Crete, it utilizes the SentenceTransformers’ Cross-Encoder class. The model was trained on the multilingual AllNLI dataset, incorporating Greek data generated through EN2EL neural machine translation. It outputs classification scores indicating “contradiction,” “entailment,” and “neutral” categories, and is also applicable in zero-shot classification scenarios, assessing the likelihood that sentences belong to provided labels or topics.
PaloBERT is a RoBERTa-based Greek language model specifically trained on social media content [81]. Its training corpus comprised 458,293 documents sourced from various Greek social media accounts. Additionally, a GPT-2 tokenizer was trained from scratch using the same dataset.
These models were selected primarily because they are pretrained specifically on Greek language data, ensuring high linguistic relevance and accuracy for processing Greek text. Specifically, GreekBERT provides high accuracy for topic classification and SA tasks, while XLM-r-Greek demonstrates versatility and efficacy in inference and unsupervised classification scenarios. PaloBERT, trained explicitly on Greek social media content, offers exceptional suitability for SA and information extraction from social media platforms.
All model-specific training parameters are detailed in Table A1. After training, the model was evaluated on the test set. Predictions were generated and compared against the sentiment labels to compute performance metrics using Scikit-learn, specifically using classification report, confusion matrix, and precision, recall, and F1 support functions. Additionally, ROC-AUC curves were plotted, while the confusion matrix was visualized using the seaborn heatmap utility [86]. All evaluation metrics, including F1 score, precision, and recall (per class), were exported for documentation and further analysis.

4. Results

4.1. Model-Based Sentiment Classification Comparison

This subsection presents a comparison of the models used for sentiment classification in each dataset separately, without incorporating topic-based distinctions. The performance of the models on the PSFD dataset is presented in Table 3. GreekBERT achieved the highest overall F1 score (0.79), indicating its superior ability to classify both positive and negative sentiment instances. It performed especially well on the negative class, achieving a precision of 0.87, recall of 0.93, and an F1 score of 0.90. The positive class performance was lower, particularly in terms of recall (0.63), although still leading among the models evaluated. XLM-r-Greek followed, with a balanced performance and an overall F1 score of 0.76. mBERT showed high precision in the positive class (0.86), but suffered from low recall (0.39), resulting in a lower F1 score and a reduced overall score (0.71). Palobert consistently ranked lowest, with the weakest results in both classes and an overall F1 score of 0.65.
The performance results of the BERT-based models on the SCH dataset are summarized in Table 4. GreekBERT achieved the best overall performance with an F1 score of 0.91, showing strong consistency across both sentiment classes. It reached 0.90 F1 score in the positive class and 0.91 in the negative class, with high precision and recall in both. XLM-r-Greek followed with a balanced performance, scoring 0.85 in the positive class and 0.88 in the negative, leading to an overall F1 score of 0.86. mBERT had comparable precision (0.88) in the positive class, but lower recall (0.65), which resulted in an F1 score of 0.75. Its performance in the negative class was stronger (F1 score: 0.84), leading to an overall score of 0.79. Palobert demonstrated the weakest performance among the four models, with lower recall and F1 score values in both classes and an overall F1 score of 0.73.
The performance of the evaluated models on the TCH dataset is reported in Table 5. GreekBERT and XLM-r-Greek both achieved the highest overall F1 score of 0.78. GreekBERT performed particularly well on the negative class, reaching a precision of 0.81, recall of 0.94, and an F1 score of 0.87. Its positive class performance was more modest, with an F1 score of 0.70. XLM-r-Greek showed slightly better balance across classes, scoring 0.72 on the positive and 0.84 on the negative class. mBERT performed moderately, with a positive class F1 score of 0.63 and a negative class score of 0.78, resulting in an overall F1 score of 0.70. Palobert yielded the weakest results, particularly in the positive class, where it only reached an 0.52 F1 score and an overall score of 0.66. Overall, GreekBERT and XLM-r-Greek led in performance, but with slightly different strengths.
Following model training and evaluation, we assessed the classification performance of the multilingual BERT (mBERT) model on the test set across different semantic topics using the accuracy metric. The model demonstrated the highest accuracy on the “Psychological/Emotional Dimension” topic (88.57%), suggesting strong predictive capacity in capturing affective language and emotion-laden content. The “Educational Dimension” topic followed with an accuracy of 78.33%, while “Learning Difficulties and ERT” achieved 72.34%. The lowest performance was observed in the “Material and Technical Conditions” topic, with an accuracy of 66.67%, indicating greater challenges in detecting sentiment in content related to infrastructure or logistics. These results highlight how sentiment detection performance varies depending on the semantic domain of the text.
Table 6 presents the sentiment classification performance of the four BERT-based models using the combined PSFD, SCH, and TCH datasets as a single input. GreekBERT achieved the best overall performance, with the highest F1 scores in both the positive (0.79) and negative (0.88) classes, as well as the highest overall F1 score (0.84). XLM-r-Greek followed closely, particularly in the negative class, with an F1 score of 0.86. mBERT demonstrated more moderate performance, while Palobert consistently underperformed across all metrics. The results confirm the generalization ability of GreekBERT across datasets and sentiment polarities.
Figure 4 presents a heatmap illustrating the positive, negative, and overall F1 scores achieved by the four BERT-based models with the combined PSFD, SCH, and TCH datasets as one input. Darker shades indicate higher F1 scores. GreekBERT consistently outperformed the other models across all categories, followed closely by XLM-r-Greek. In contrast, PaloBERT showed lower performance, particularly in the positive class. The heatmap provides a clear visual summary of the comparative strengths and weaknesses of each model.
Receiver Operating Characteristic (ROC) curves visualize the trade-off between true positive rate and false positive rate across thresholds. Figure 5 presents the ROC curves for all four evaluated models: GreekBERT, XLM-r-Greek, mBERT, and PaloBERT, based on sentiment classification using the combined PSFD, SCH, and TCH datasets as a single input. The Area Under the Curve (AUC) values provide a reliable measure of each model’s discriminative ability. GreekBERT achieved the highest AUC, confirming its strong ability to distinguish between positive and negative sentiments. XLM-r-Greek also performed competitively, followed by mBERT, while PaloBERT demonstrated the lowest AUC. These findings align with previous metrics such as F1 score, reaffirming GreekBERT’s overall superior classification performance across datasets.

4.2. Topic-Based Sentiment Classification Performance

This subsection presents a comparison of the models used for TBSA, both within each dataset separately and across the combined datasets as one input, focusing on sentiment predictions within topics. Table 7 presents the results of topic-based sentiment classification for the PSFD dataset. The table compares the performance of four BERT-based models, GreekBERT, XLM-r-Greek, mBERT, and Palobert, across four thematic topic classes. Each model’s performance was evaluated using accuracy, F1 score for the negative class, and F1 score for the positive class. Palobert achieved the highest accuracy overall (91.4%) in Class 3 (Psychological/Emotional Dimension), also obtaining the highest F1 score for the negative class (95.2%). The highest F1 score for the positive class (93.3%) was obtained by XLM-r-Greek, also in Class 3. Among the top five metrics across the entire table, Class 3 consistently stands out across models. This suggests that the Psychological/Emotional Dimension is the most predictable category in terms of sentiment classification, regardless of the model used.
Table 8 reports the performance of BERT-based models on topic-based sentiment classification using the SCH dataset. GreekBERT achieved the highest accuracy (91.3%) and F1 score for the negative class (93.0%) in Class 1 (Material and Technical Conditions), along with the highest positive-class F1 score (92.0%) in Class 2. mBERT also performed strongly, especially in Class 4 (Learning Difficulties and ERT), where it achieved an F1 negative score of 91.0% and high overall accuracy (83.33%). The top-performing metrics are concentrated in GreekBERT and mBERT, indicating their robustness across topics. In contrast, Palobert showed lower and less consistent performance across topic classes, particularly in Class 3, where it scored only 33.0% in F1-positive.
Table 9 presents the topic-based sentiment classification results for the TCH dataset. GreekBERT achieved the highest accuracy (92.31%) and F1 negative score (96.00%) in Class 3 (Psychological/Emotional Dimension). mBERT also performed strongly in the same class, reaching an F1 positive score of 88.00%, while XLM-r-Greek and Palobert both scored 85.00% for F1 positive in Class 3. These values rank among the top five metrics across all models and topics. Notably, Class 3 consistently emerged as the most predictive topic, while Class 2 and Class 4 exhibited more moderate results. Palobert and mBERT showed weaker performance in Class 2 (Educational Dimension), particularly in detecting positive sentiment.
Table 10 presents the performance of BERT-based models on topic-based sentiment classification aggregated with the combined datasets as one input. GreekBERT achieved strong results in Class 3 (Psychological/Emotional Dimension), with an F1 negative score of 93.88% and F1 positive score of 87.00%. XLM-r-Greek performed best in Class 2, achieving the highest accuracy (82.73%) and F1 positive score (83.19%). mBERT showed excellent performance in Class 3, obtaining the highest overall accuracy (92.11%) and F1 negative score (95.52%). In contrast, Palobert underperformed across most classes, particularly in Class 4. These results highlight that Class 3 consistently yields high model performance, while Class 4 remains more challenging across all models. Regarding the accuracy of the GreekBERT, XLM-r-Greek, mBERT, and Palobert models across the four defined topic classes, the values represent average accuracy scores aggregated across the PSFD, SCH, and TCH datasets. GreekBERT consistently demonstrates the highest performance across all classes, particularly in Class 3 (Psychological/Emotional Dimension) and Class 2 (Educational Dimension). In contrast, Palobert yields lower accuracy in most classes. All models perform best in Class 3, suggesting that emotionally oriented text may provide more distinct sentiment signals. These trends reinforce the importance of topic-aware evaluation in sentiment classification tasks.
The visualizations further clarify performance trends across topic classes. The heatmaps (Figure 6) reveal that GreekBERT and XLM-r-Greek consistently achieve high F1 scores for positive sentiment, particularly in Class 3 (Psychological/Emotional Dimension) and Class 4 (Learning Difficulties and ERT). In contrast, Palobert shows a substantial decline in Class 3, indicating difficulty in recognizing positive sentiment in emotionally complex content. These trends highlight the influence of both model architecture and topic type on sentiment classification performance. This difficulty in predicting specific classes may also stem from the limited number of annotated examples in those categories, which restricts the models’ ability to generalize effectively.
Figure 6 presents heatmaps illustrating the F1 scores for both positive and negative sentiment classes across all topic classes and BERT-based models. Figure 6a displays the F1 scores for the positive sentiment class, while Figure 6b focuses on the negative sentiment class. The results show that models tend to perform more consistently on negative sentiments, with notably higher scores across all topic classes. GreekBERT and XLM-r-Greek generally exhibit superior performance, especially in Class 3 (Psychological/Emotional Dimension), where GreekBERT achieves an F1 score of 87.00% for the positive class and 93.88% for the negative class. Conversely, performance on Class 1 (Material and Technical Conditions) is lower for most models, particularly for the positive sentiment class. These patterns suggest that sentiment polarity is more distinguishable in emotionally expressive content, such as that found in Class 3, while technical or neutral content poses greater challenges for classification models.

5. Discussion

5.1. Findings

The results from Table 3 highlight the consistent superiority of GreekBERT in the PSFD sentiment classification task. Its strong performance across both classes, and particularly in the negative class, demonstrates its capacity to generalize well to the language patterns present in this dataset. A key observation is that all models performed significantly better on negative sentiment detection compared to positive sentiment. Negative class metrics, especially recall and F1 score, were consistently higher across models. This suggests that negative expressions in the PSFD dataset are more distinct, possibly due to clearer lexical cues or more consistent syntactic patterns, whereas positive expressions may be more subtle or diverse, leading to reduced recall and F1 scores. These findings point to a general challenge in the detection of positive sentiment, indicating the need for further research into data balancing or class-specific optimization strategies.
As shown in Table 4, GreekBERT consistently outperformed the other models in the SCH dataset. Its nearly equal and high scores in both sentiment classes suggest strong generalization ability. Interestingly, unlike the PSFD dataset, where performance on the negative class was noticeably higher, the SCH dataset results are more balanced. GreekBERT, XLM-r-Greek, and even mBERT maintained solid scores for both positive and negative predictions. However, Palobert lagged behind, especially in recall for the positive class, which limited its overall performance. The relatively balanced results in SCH may reflect the dataset’s structure or linguistic clarity across sentiment types. This suggests that the nature of the dataset plays a critical role in sentiment model performance and highlights the consistency of GreekBERT across different textual domains.
The results in Table 5 show that both GreekBERT and XLM-r-Greek are highly effective on the TCH dataset, although they display different behavior across sentiment classes. GreekBERT was particularly strong in detecting negative sentiment, achieving the highest recall and F1 score in that class. However, its lower recall in the positive class (0.60) reduced its balance. In contrast, XLM-r-Greek maintained more consistent performance across both classes, suggesting greater class balance sensitivity. As observed in the PSFD dataset, all models performed better on the negative class than on the positive class, especially in terms of recall. This recurring trend suggests that expressions of negative sentiment may be more linguistically distinct or more consistently annotated across datasets. The TCH results reinforce the conclusion that GreekBERT and XLM-r-Greek are reliable models, although further work may be needed to enhance performance in positive sentiment detection.
These findings have practical implications for educational policy. For instance, the consistent detectability of emotional distress suggests that automated tools could support mental health monitoring in educational settings, helping educators identify emotional challenges in student or teacher narratives. The results in Table 7 reveal that topic class plays a critical role in sentiment classification performance. Notably, Class 3 (Psychological/Emotional Dimension) yielded the highest scores in both accuracy and F1 metrics, across multiple models. This trend suggests that sentiment polarity is more easily distinguishable in emotionally charged content. On the other hand, Class 4 (Learning Difficulties and ERT) showed lower F1 scores for positive sentiment, particularly for Palobert (16.7%), indicating greater difficulty in identifying positive expressions within that context. These findings highlight the need for topic-sensitive evaluation in SA and suggest that model performance cannot be fully understood without considering the thematic content of the text. These topics include less frequently discussed or highly specialized concepts, such as “parallel support teacher” or “attention deficit”, which are less common in general-purpose language model pretraining. Additionally, Learning Difficulties is the third most frequent category out of the four, making it relatively underrepresented in the dataset. This imbalance may limit the model’s exposure to sufficient examples during training. The difficulty in predicting specific classes may also stem from the limited number of annotated examples in those categories, which restricts the model’s ability to generalize effectively. Transformer-based models tend to perform more reliably with larger and more balanced datasets, and performance may be affected when topic-specific data are sparse or linguistically complex.
As shown in Table 8, the SCH dataset reveals clear differences in model behavior across thematic categories. GreekBERT excelled in identifying both positive and negative sentiments in technical and educational contexts (Classes 1 and 2), while mBERT showed notable performance in Class 4, suggesting its strength in detecting sentiment related to learning challenges. The highest F1 scores were observed in more structured or emotionally salient categories, whereas performance dropped in categories with more ambiguity, such as Psychological/Emotional content (Class 3). These results support the view that topic structure and emotional clarity affect sentiment detection and highlight the importance of evaluating model performance through a topic-based lens.
The topic-specific results on the TCH dataset (Table 9) reinforce the trend seen in other datasets: sentiment in psychologically or emotionally loaded topics (Class 3) is more reliably classified by BERT-based models. GreekBERT and mBERT excelled in this class, achieving the highest accuracy and F1 scores across all categories. In contrast, Class 2 (Educational Dimension) was more challenging, particularly for Palobert and mBERT in terms of positive sentiment identification. These findings suggest that model performance is highly sensitive to the thematic domain and support the integration of topic-aware evaluation in SA pipelines.
The F1 score comparison highlights a common trend across multilingual BERT-based models. GreekBERT’s strong performance, especially in negative sentiment detection, suggests that it captures polarity features more effectively in Greek-language content. XLM-r-Greek also performed reliably, indicating its consistency across domains. In contrast, PaloBERT’s relatively poor results may be attributed to limitations in pretraining or domain mismatch. One limitation of this study is the relatively small number of annotated interview samples, which may restrict the generalizability of the findings to broader educational populations. Moreover, the exclusive use of binary sentiment labels (positive/negative) excludes neutral responses that may carry valuable nuance, particularly in emotionally mixed or context-sensitive statements. Future work could explore more granular sentiment categories or continuous sentiment scales to better capture the subtleties in stakeholder feedback. The higher F1 scores in negative sentiment compared to positive across most models reinforce previous observations about the asymmetry in sentiment expression in the datasets. Additionally, the heatmap presented in Figure 4 further illustrates these differences, showing that GreekBERT achieved the highest F1 scores across both sentiment classes and overall performance. These findings highlight the importance of using domain-specific or language-adapted models for sentiment classification tasks in underrepresented languages such as Greek.
The ROC AUC analysis (Figure 5) reinforces earlier performance trends observed in precision, recall, and F1 scores. GreekBERT’s consistently high AUC demonstrates its effective generalization across datasets, likely due to its training on Greek-specific corpora. XLM-r-Greek also exhibits reliable performance, benefiting from multilingual contextualization. On the contrary, mBERT’s moderate AUC suggests less effective representation for Greek, while PaloBERT’s lower score highlights potential issues in handling sentiment polarity. These differences may stem from model architecture, training data diversity, or domain mismatch. Although Palobert is pre-trained on Greek social media data, its performance was comparatively lower. This can be attributed to a linguistic mismatch, as the dataset consists of verbal interview transcripts that differ substantially in style, structure, and register from the informal and often fragmented language typically found in social media. Consequently, the representations learned by Palobert may not generalize effectively to the more formal, discourse-rich nature of the data, which likely contributes to its reduced effectiveness in the classification tasks.
It is also worth noting that ROC AUC values are consistently higher than the corresponding overall F1 scores for each model. This is expected, as AUC measures the model’s ability to discriminate between classes across all classification thresholds, while the F1 score is bound to a single decision point and is more sensitive to class imbalances and specific misclassifications. Including both metrics provides a more comprehensive evaluation: F1 reflects real-world performance at a specific threshold, and AUC reveals the overall classification potential. This distinction further validates GreekBERT’s superior performance, as it maintains high scores in both metrics.
The highest performance was observed when sentiment classification was applied without reference to specific topics, indicating that the models perform better when sentiment is analyzed independently of thematic topic categories. In addition, combining the datasets as one input led to improved results because the larger and more diverse set of training examples helped the models generalize more effectively across sentiment classes.

5.2. Error Analysis

This section presents illustrative examples of classification errors related to both sentiment and topic predictions. While many model outputs aligned well with human annotations, several misclassifications reveal limitations in capturing contextual nuances and semantic subtleties.
As shown in Table 11, the model performs well when lexical cues are unambiguous and align with the overall tone or context of the sentence.
The first two examples reflect successful predictions in both the sentiment and topic dimensions, demonstrating the model’s ability to handle clear lexical cues and a consistent tone. The following two misclassified examples are examined to identify the linguistic and contextual factors that may have led to the model’s errors.
  • Example 1: «Δεν σας κρύβω πως πανικοβλήθηκα.» Translation: “I won’t hide from you that I panicked.” This sentence was correctly classified as having negative sentiment, but was misclassified in terms of topic. Although it expresses emotional distress, the model assigned it to the topic of Material and Technical Conditions, likely due to the presence of the word «πανικός» (panic), which may co-occur with technical issues in other examples. This indicates semantic overlap between thematic categories and highlights challenges in fine-grained topic differentiation.
  • Example 2: «Ήταν μια πολύτιμη εμπειρία που μας έδειξε πόσο τεχνολογικά απροετοίμαστο ήταν το σύστημα και πόσο μόνοι μας ήμασταν τελικά.» Translation: “It was a valuable experience that showed us how technologically unprepared the system was and how alone we truly were.” Although the sentence begins with the positive phrase «πολύτιμη εμπειρία» (valuable experience), its overall tone is clearly negative. The model misclassified it as positive sentiment, likely influenced by the surface-level lexical cue. This example highlights the model’s difficulty in recognizing irony, contradictory expressions, or shifts in emotional tone within the same sentence.
The observed misclassifications are primarily attributed to the following:
  • Overlapping contextual features across topic categories.
  • The complexity of emotional expression, particularly in cases involving mixed or implicit sentiments.
  • The influence of misleading keywords that override deeper semantic interpretation.
These examples emphasize the importance of enriching the training dataset with more contextually complex instances in order to improve the model’s sensitivity to subtle linguistic cues and discourse-level meaning.

5.3. Comparison with Relevant Research

While Section 2.3 and Section 2.4 provided a broader overview of relevant studies, the five works selected for Table 12 represent a focused subset chosen for their strong relevance and close alignment with our study in terms of language, model architecture, and task design.
Michailidis [15] applied GreekBERT to product reviews from a Greek e-commerce platform, achieving an F1 of 0.96 and surpassing both traditional and neural baselines. The dataset included binary sentiment annotations and showed the power of BERT fine-tuning even in comparison to Large Language Models like GPT-4. GreekBERT achieved significantly higher classification performance than traditional Machine Learning models in Greek product review Sentiment Analysis, underscoring the advantages of transformer-based architectures for morphologically rich languages.
Chatzimina et al. [45] focused on clinical Greek dialogues, classifying utterances into three sentiment categories. BERT outperformed other transformers (e.g., RoBERTa, XLNet), reaching a macro-F1 of 0.95, highlighting BERT’s capacity to capture emotion in health contexts.
Bilianos [53] evaluated Greek product reviews using Greek BERT with an SVM classifier, reporting an F1 of 0.97, despite a small dataset (480 reviews). This demonstrated BERT’s effectiveness even with minimal data and highlighted the impact of transformer-based models in consumer review analysis.
Patsiouras et al. [55] worked on Greek political tweets using GreekBERT with data augmentation. They classified tweets into three sentiment classes and reached a 0.83 F1, showing reliability in a domain with subjectivity and class imbalance. Through extensive evaluation with Deep Neural Networks and data augmentation, the study identified strategies for each sentiment category, offering a benchmark for future research in Greek political SA.
Katika et al. [56] combined topic modeling and SA on Greek tweets about Long COVID. A fine-tuned GreekBERT model reached and aligned well with manual annotations. The study revealed that domain-tuned models like Greek-BERT can effectively capture public health concerns from social media, achieving 94% accuracy.
In our study (2025), several BERT-based models were evaluated on topic-specific sentiment classification. GreekBERT performed best, achieving an F1 score of 0.91 in the best case. The analysis confirmed that topic type significantly affects performance, with Class 3 (Psychological/Emotional) being the most predictable. Compared to prior work [15,45,53,55,56], our study differs in its focus on a multi-topic domain-specific dataset derived from the Greek educational sector, whereas previous studies predominantly relied on product reviews, tweets, or clinical dialogues. Furthermore, our analysis includes a systematic comparison across topic classes, providing insights into how different themes influence sentiment prediction performance. Despite these methodological differences, our best F1 score (0.91) is comparable to those reported in prior studies, highlighting that BERT-based models can achieve competitive performance even in complex and diversified domains such as education.
This study makes several contributions to the field of educational NLP, specifically in the context of Emergency Remote Teaching. First, this analysis applied transformer based architectures, with an emphasis on Greek specific models, enabling the handling of sentiment in a low-resource language. These models were evaluated in a multi-class domain-specific setting across three manually annotated datasets, offering new insights into performance under educational and linguistic constraints (RQ1). The evaluation further revealed performance differences across stakeholder groups such as parents, school directors, and teachers (RQ2). A TBSA approach was implemented on Modern Greek interview corpora, capturing both the thematic and emotional aspects of the Emergency Remote Teaching experience. The Sentiment Analysis was topic-level, aligned with predefined thematic classes. However, the inclusion of topic context did not improve classification accuracy, as models performed better in sentiment-only settings due to the limited number of topic-specific examples and the data demands of transformer-based models (RQ3). GreekBERT achieved the highest accuracy in identifying both positive and negative sentiment within the domain (RQ4). Third, this study contributes a Greek corpus of manually segmented and labeled interviews, annotated for both sentiment and topic, enabling future research using low-resource education-oriented Sentiment Analyses. Finally, the findings expose the emotional dynamics experienced by teachers during Emergency Remote Teaching, providing empirical grounding for pedagogical planning and support systems in similar crisis-driven contexts.

5.4. Limitations

While the results of this study are promising, several limitations should be acknowledged. The Greek datasets used reflect authentic feedback from educational stakeholders within the context of ERT. Although the dataset size may seem limited for typical ML benchmarks, it offers unique value, as it comprises real-world domain-specific content collected under natural conditions. The scope of this study remains constrained by the limited size of the available dataset, which reflects the challenges involved in collecting large-scale and high-quality Greek-language data in specialized educational domains. Nonetheless, this type of data ensures rich linguistic context and authentic sentiment expression, which are often absent in large-scale generic corpora.
Nonetheless, the limited availability of high-quality Greek-language datasets with varying themes and writing styles hinders the broader development and evaluation of general-purpose text-classification models. Another important limitation is the reliance on existing pre-trained models. While models such as GreekBERT have shown strong results, their performance could potentially improve if they are fine-tuned on larger and more thematically varied corpora that better reflect the linguistic patterns and terminology of the Greek educational context.
GreekBERT and similar transformer-based models are typically pre-trained on general-domain corpora like Wikipedia, the European Parliament Proceedings, and OSCAR. As a result, they may struggle to fully grasp the details and terminology specific to the educational sector. Additionally, this study’s findings are inherently linked to the quality and representativeness of the datasets used, which could impact the generalizability of the results. Finally, given the limited availability of Greek-specific language models, further validation with new datasets and future pre-trained Greek transformers will be necessary to establish the robustness and reliability of these findings across broader applications.

6. Conclusions and Future Work

This study explored the application of transformer models for TBSA in Greek-language interview data within an educational context. By focusing on topic-aware sentiment classification, it addressed how model performance varies across different stakeholder groups and thematic content. The results demonstrate that GreekBERT consistently outperformed the other models, particularly in identifying negative sentiments and processing emotionally sensitive content. The introduction of three original datasets and the comparative evaluation of four multilingual and Greek-specific models provided new resources and insights for the field. These findings highlight the value of using context-specific models and carefully designed datasets when analyzing sentiment in low-resource languages. Overall, this study contributes to the development of more effective SA tools for socially and linguistically complex domains.
Future work could explore several directions to build on the findings of this study. One possibility is to expand the dataset by including more interviews, especially from other groups such as students or education policymakers, to capture a wider range of views. Another promising direction is to move beyond simple positive and negative labels by using more detailed sentiment categories such as neutral or mixed emotions, or even scoring the intensity of each response. It may also be valuable to look at how sentiment changes throughout the course of an interview, especially in response to emotionally charged topics. Adding contextual details, such as the background of the speaker or the timing and structure of the interview, could help models better understand what influences emotional expression. Finally, future studies might look at how these methods can be adapted for other low-resource languages or use in multilingual settings through transfer learning techniques. The integration of instruction-tuned or adapter-based LLMs is also part of our planned research agenda to be pursued once current resource limitations are addressed. We also plan to experiment with ensemble methods to improve overall prediction robustness and model stability.

Author Contributions

Conceptualization, S.T.; methodology, D.M., M.N.N., S.N. and S.T.; software, S.T.; validation, S.N.; writing—original draft preparation, S.T.; writing—review and editing, S.N. and M.N.N.; visualization, S.T.; supervision, K.L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This research required ethical approval and was conducted in accordance with institutional ethical guidelines and approvals. It was approved by the Ionian University Ethics Committee under protocol code 869/07-03-2023, on 7 March 2023. All ethical standards were strictly followed to ensure participant confidentiality and the responsible handling of data throughout the research process.

Informed Consent Statement

Informed consent was obtained from all participants involved in the study. Written informed consent forms outlined the purpose of data collection, the secure storage of personal information, and exclusive research use, in compliance with ethical standards.

Data Availability Statement

The datasets generated and analyzed in this study are publicly available at https://hilab.di.ionio.gr/index.php/en/datasets/ (accessed on 10 May 2025). This repository ensures open access to the data, allowing for researchers and practitioners to explore and build upon the findings presented in this study. The dataset repository includes metadata, data files, and all associated protocols, in compliance with the Creative Commons Attribution Non-Commercial No Derivatives 4.0 International License, accessible at https://creativecommons.org/licenses/by-nc-nd/4.0/ (accessed on 10 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Training Parameters by Model.
Table A1. Training Parameters by Model.
ModelParameters
GreekBERTLearning Rate: 2 × 10 5
Batch Size: 16
Maximum Sequence Length: 128
Number of Epochs: 10
Weight Decay: 0.05
Loss Function: Weighted CrossEntropyLoss
Evaluation Strategy: Per Epoch
Save Strategy: Per Epoch
Early Stopping: Enabled
Test Size: 0.2
XLM-r-GreekLearning Rate: 2 × 10 5
Batch Size: 16
Maximum Sequence Length: 128
Number of Epochs: 10
Weight Decay: 0.05
Loss Function: Weighted CrossEntropyLoss
Evaluation Strategy: Per Epoch
Save Strategy: Per Epoch
Early Stopping: Enabled
Test Size: 0.2
PalobertLearning Rate: 2 × 10 5
Batch Size: 16
Maximum Sequence Length: 128
Number of Epochs: 10
Weight Decay: 0.05
Loss Function: Weighted CrossEntropyLoss
Evaluation Strategy: Per Epoch
Save Strategy: Per Epoch
Early Stopping: Enabled
Test Size: 0.2
mBERTLearning Rate: 2 × 10 5
Batch Size: 16
Maximum Sequence Length: 128
Number of Epochs: 10
Weight Decay: 0.05
Loss Function: Weighted CrossEntropyLoss
Evaluation Strategy: Per Epoch
Save Strategy: Per Epoch
Early Stopping: Enabled
Test Size: 0.2

References

  1. Liu, B. Sentiment Analysis and Opinion Mining; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  2. Zhang, W.; Li, X.; Deng, Y.; Bing, L.; Lam, W. A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. 2022, 35, 11019–11038. [Google Scholar] [CrossRef]
  3. Lin, C.; He, Y. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, 2–6 November 2009; pp. 375–384. [Google Scholar]
  4. Xianghua, F.; Guo, L.; Yanyan, G.; Zhiqiang, W. Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon. Knowl.-Based Syst. 2013, 37, 186–195. [Google Scholar] [CrossRef]
  5. Rodríguez-Ibáñez, M.; Casánez-Ventura, A.; Castejón-Mateos, F.; Cuenca-Jiménez, P. A review on sentiment analysis from social media platforms. Expert Syst. Appl. 2023, 223, 119862. [Google Scholar] [CrossRef]
  6. Hodges, C.; Moore, S.; Lockee, B.; Trust, T.; Bond, A. The difference between emergency remote teaching and online learning. Educ. Rev. 2020, 27, 1–9. [Google Scholar]
  7. Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the real world: A survey on NLP applications. Information 2023, 14, 242. [Google Scholar] [CrossRef]
  8. Rahali, A.; Akhloufi, M.A. End-to-end transformer-based models in textual-based NLP. AI 2023, 4, 54–110. [Google Scholar] [CrossRef]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  10. Islam, S.; Elmekki, H.; Elsebai, A.; Bentahar, J.; Drawel, N.; Rjoub, G.; Pedrycz, W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst. Appl. 2023, 241, 122666. [Google Scholar] [CrossRef]
  11. Chauhan, G.S.; Nahta, R.; Meena, Y.K.; Gopalani, D. Aspect based sentiment analysis using deep learning approaches: A survey. Comput. Sci. Rev. 2023, 49, 100576. [Google Scholar] [CrossRef]
  12. Rana, M.R.R.; Nawaz, A.; Ali, T.; Alattas, A.S.; AbdElminaam, D.S. Sentiment Analysis of Product Reviews Using Transformer Enhanced 1D-CNN and BiLSTM. Cybern. Inf. Technol. 2024, 24, 112–131. [Google Scholar] [CrossRef]
  13. Pakray, P.; Gelbukh, A.; Bandyopadhyay, S. Natural Language Processing Applications for Low-Resource Languages. Nat. Lang. Process. 2025, 31, 183–197. [Google Scholar] [CrossRef]
  14. Alexandridis, G.; Varlamis, I.; Korovesis, K.; Caridakis, G.; Tsantilas, P. A survey on sentiment analysis and opinion mining in greek social media. Information 2021, 12, 331. [Google Scholar] [CrossRef]
  15. Michailidis, P.D. A Comparative Study of Sentiment Classification Models for Greek Reviews. Big Data Cogn. Comput. 2024, 8, 107. [Google Scholar] [CrossRef]
  16. UNESCO. Education: From School Closure to Recovery. 2020. Available online: https://www.unesco.org/en/covid-19/education-response (accessed on 24 March 2023).
  17. Bozkurt, A.; Jung, I.; Xiao, J.; Vladimirschi, V.; Schuwer, R.; Egorov, G.; Lambert, S.; Al-Freih, M.; Pete, J.; Olcott, D., Jr. A global outlook to the interruption of education due to COVID-19 pandemic: Navigating in a time of uncertainty and crisis. Asian J. Distance Educ. 2020, 15, 1–126. [Google Scholar]
  18. Cone, L.; Brøgger, K.; Berghmans, M.; Decuypere, M.; Förschler, A.; Grimaldi, E.; Hartong, S.; Hillman, T.; Ideland, M.; Landri, P. Pandemic acceleration: COVID-19 and the emergency digitalization of European education. Eur. Educ. Res. J. 2022, 21, 845–868. [Google Scholar] [CrossRef]
  19. Ferri, F.; Grifoni, P.; Guzzo, T. Online learning and emergency remote teaching: Opportunities and challenges in emergency situations. Societies 2020, 10, 86. [Google Scholar] [CrossRef]
  20. Ewing, L.A.; Cooper, H.B. Technology-enabled remote learning during COVID-19: Perspectives of Australian teachers, students and parents. Technol. Pedagog. Educ. 2021, 30, 41–57. [Google Scholar] [CrossRef]
  21. Kelly, P.; Hofbauer, S.; Gross, B. Renegotiating the public good: Responding to the first wave of COVID-19 in England, Germany, and Italy. Eur. Educ. Res. J. 2021, 20, 584–609. [Google Scholar] [CrossRef]
  22. Misirli, O.; Ergulec, F. Emergency remote teaching during the COVID-19 pandemic: Parents’ experiences and perspectives. Educ. Inf. Technol. 2021, 26, 6699–6718. [Google Scholar] [CrossRef] [PubMed]
  23. Lampropoulos, G.; Admiraal, W. Comparing Emergency Remote Learning with Traditional Learning in Primary Education: Primary School Student Perspectives. Open Educ. Stud. 2024, 6, 20220215. [Google Scholar] [CrossRef]
  24. Pihlainen, K.; Turunen, S.; Melasalmi, A.; Koskela, T. Parents’ competence, autonomy, and relatedness in supporting children with special educational needs in emergency remote teaching during COVID-19 lockdown. Eur. J. Spec. Needs Educ. 2023, 38, 704–716. [Google Scholar] [CrossRef]
  25. Fichten, C.; Olenik-Shemesh, D.; Asuncion, J.; Jorgensen, M.; Colwell, C. Higher education, information and communication technologies and students with disabilities: An overview of the current situation. In Improving Accessible Digital Practices in Higher Education; Palgrave Pivot: Cham, Switzerland, 2020; pp. 21–44. [Google Scholar]
  26. Tzimiris, S.; Nikiforos, S.; Kermanidis, K.L. Post-pandemic pedagogy: Emergency remote teaching impact on students with functional diversity. Educ. Inf. Technol. 2023, 28, 10285–10328. [Google Scholar] [CrossRef] [PubMed]
  27. Nikiforos, S.; Anastasopoulou, E.; Pappa, A.; Tzanavaris, S.; Kermanidis, K.L. Motives and barriers in Emergency Remote Teaching: Insights from the Greek experience. Discov. Educ. 2024, 3, 281. [Google Scholar] [CrossRef]
  28. Organisation for Economic Co-operation and Development. Education Policy Outlook: Greece—Snapshot of the Responses to the COVID-19 Crisis. Technical Report, Organisation for Economic Co-operation and Development. 2020. Available online: https://www.oecd.org/education/policy-outlook/country-profile-Greece-2020.pdf (accessed on 26 April 2025).
  29. Miyah, Y.; Benjelloun, M.; Lairini, S.; Lahrichi, A. COVID-19 impact on public health, environment, human psychology, global socioeconomy, and education. Sci. World J. 2022, 2022, 5578284. [Google Scholar] [CrossRef] [PubMed]
  30. Jimoyiannis, A.; Koukis, N.; Tsiotakis, P. Shifting to emergency remote teaching due to the COVID-19 pandemic: An investigation of Greek teachers’ beliefs and experiences. In Technology and Innovation in Learning, Teaching and Education: Second International Conference, TECH-EDU 2020, Vila Real, Portugal, 2–4 December 2020, Proceedings 2; Springer: Berlin/Heidelberg, Germany, 2021; pp. 320–329. [Google Scholar]
  31. Kostas, A.; Paraschou, V.; Spanos, D.; Sofos, A. Emergency Remote Teaching in K-12 Education During COVID-19 Pandemic: A Systematic Review of Empirical Research in Greece. In Research on E-Learning and ICT in Education: Technological, Pedagogical, and Instructional Perspectives; Springer: Berlin/Heidelberg, Germany, 2023; pp. 235–260. [Google Scholar]
  32. Tzimiris, S.; Nikiforos, M.N.; Nikiforos, S.; Kermanidis, K.L. Challenges and Opportunities of Emergency Remote Teaching: Linguistic Analysis on School Directors’ Interviews. Eur. J. Eng. Technol. Res. 2023, 1, 53–60. [Google Scholar] [CrossRef]
  33. Tzimiris, S.; Nikiforos, S.; Kermanidis, K.L. The Effect of Emergency Remote Teaching on Students with Special Educational Needs and/or Disability during the COVID-19 Pandemic: The Parents ‘View. Int. J. Inf. Educ. Technol. 2023, 13, 684–689. [Google Scholar] [CrossRef]
  34. Tan, K.L.; Lee, C.P.; Lim, K.M. A survey of sentiment analysis: Approaches, datasets, and future research. Appl. Sci. 2023, 13, 4550. [Google Scholar] [CrossRef]
  35. Kalamatianos, G.; Mallis, D.; Symeonidis, S.; Arampatzis, A. Sentiment analysis of greek tweets and hashtags using a sentiment lexicon. In Proceedings of the 19th Panhellenic Conference on Informatics, Athens, Greece, 1–3 October 2015; pp. 63–68. [Google Scholar]
  36. Spatiotis, N.; Mporas, I.; Paraskevas, M.; Perikos, I. Sentiment analysis for the Greek language. In Proceedings of the 20th Pan-Hellenic Conference on Informatics, Patras, Greece, 10–12 November 2016; pp. 1–4. [Google Scholar]
  37. Tsakalidis, A.; Papadopoulos, S.; Voskaki, R.; Ioannidou, K.; Boididou, C.; Cristea, A.I.; Liakata, M.; Kompatsiaris, Y. Building and evaluating resources for sentiment analysis in the Greek language. Lang. Resour. Eval. 2018, 52, 1021–1044. [Google Scholar] [CrossRef] [PubMed]
  38. Pontiki, M.; Galanis, D.; Papageorgiou, H.; Androutsopoulos, I.; Manandhar, S.; Al-Smadi, M.; Al-Ayyoub, M.; Zhao, Y.; Qin, B.; De Clercq, O.; et al. Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the International Workshop on Semantic Evaluation, San Diego, CA, USA, 16–17 June 2016; pp. 19–30. [Google Scholar]
  39. Pavlopoulos, I. Aspect Based Sentiment Analysis. Ph.D. Thesis, Athens University Economics and Business, Athens, Greece, 2014. [Google Scholar]
  40. Wang, Y.; Huang, M.; Zhu, X.; Zhao, L. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–6 November 2016; pp. 606–615. [Google Scholar]
  41. Rokkou, G.; Spatiotis, N.; Triantafyllou, V.; Paraskevas, M. A new approach applying Sentiment Analysis in Greek Language. In Proceedings of the 25th Pan-Hellenic Conference on Informatics, Volos, Greece, 26–28 November 2021; pp. 235–241. [Google Scholar]
  42. Outsios, S.; Karatsalos, C.; Skianis, K.; Vazirgiannis, M. Evaluation of greek word embeddings. arXiv 2019, arXiv:1904.04032. [Google Scholar]
  43. Koutsikakis, J.; Chalkidis, I.; Malakasiotis, P.; Androutsopoulos, I. Greek-bert: The greeks visiting sesame street. In Proceedings of the 11th Hellenic conference on artificial intelligence, Athens, Greece, 2–4 of September 2020; pp. 110–117. [Google Scholar]
  44. Karavangeli, E.A.; Pantazi, D.A.; Iliakis, M. DistilGREEK-BERT: A Distilled Version of the GREEK-BERT Model. Bachelor’s Thesis, National and Kapodistrian University of Athens, Department of Informatics and Telecommunications, Athens, Greece, 2023. [Google Scholar]
  45. Chatzimina, M.E.; Papadaki, H.A.; Pontikoglou, C.; Tsiknakis, M. A comparative sentiment analysis of Greek clinical conversations using BERT, RoBERTa, GPT-2, and XLNet. Bioengineering 2024, 11, 521. [Google Scholar] [CrossRef] [PubMed]
  46. Kapoteli, E.; Koukaras, P.; Tjortjis, C. Social media sentiment analysis related to COVID-19 vaccines: Case studies in English and Greek language. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, León, Spain, 14–17 June 2022; pp. 360–372. [Google Scholar]
  47. Markopoulos, G.; Mikros, G.; Iliadi, A.; Liontos, M. Sentiment analysis of hotel reviews in Greek: A comparison of unigram features. In Proceedings of the Cultural Tourism in a Digital Era: First International Conference IACuDiT, Athens, Greece, 30 May–1 June 2014; pp. 373–383. [Google Scholar]
  48. Charalampakis, B.; Spathis, D.; Kouslis, E.; Kermanidis, K. Detecting irony on greek political tweets: A text mining approach. In Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS), Rhodes Island, Greece, 25–28 September 2015; pp. 1–5. [Google Scholar]
  49. Athanasiou, V.; Maragoudakis, M. A novel, gradient boosting framework for sentiment analysis in languages where NLP resources are not plentiful: A case study for modern Greek. Algorithms 2017, 10, 34. [Google Scholar] [CrossRef]
  50. Giatsoglou, M.; Vozalis, M.G.; Diamantaras, K.; Vakali, A.; Sarigiannidis, G.; Chatzisavvas, K.C. Sentiment analysis leveraging emotions and word embeddings. Expert Syst. Appl. 2017, 69, 214–224. [Google Scholar] [CrossRef]
  51. Kotsifakou, K.M.; Sotiropoulos, D.N. Greek political speech classification using BERT. In Proceedings of the 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023; pp. 1–7. [Google Scholar]
  52. Chantavaridou, E.; Kapidakis, S. Lemmatization and sentiment analysis of Greek political tweets during the pre-election period of 2023. J. Integr. Inf. Manag. 2023, 8, 37–43. [Google Scholar]
  53. Bilianos, D. Experiments in text classification: Analyzing the sentiment of electronic product reviews in greek. J. Quant. Linguist. 2022, 29, 374–386. [Google Scholar] [CrossRef]
  54. Dontaki, C.; Koukaras, P.; Tjortjis, C. Sentiment analysis on english and greek twitter data regarding vaccinations. In Proceedings of the 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023; pp. 1–8. [Google Scholar]
  55. Patsiouras, E.; Koroni, I.; Mademlis, I.; Pitas, I. Greekpolitics: Sentiment analysis on greek politically charged tweets. In Proceedings of the 2023 31st European Signal Processing Conference (EUSIPCO), Helsinki, Finland, 4–8 September 2023; pp. 1320–1324. [Google Scholar]
  56. Katika, A.; Zoulias, E.; Koufi, V.; Malamateniou, F. Mining Greek tweets on long COVID using sentiment analysis and topic modeling. In Healthcare Transformation with Informatics and Artificial Intelligence; IOS Press: Amsterdam, The Netherlands, 2023; pp. 545–548. [Google Scholar]
  57. Aivatoglou, G.; Fytili, A.; Arampatzis, G.; Zaikis, D.; Stylianou, N.; Vlahavas, I. End-to-end aspect extraction and aspect-based sentiment analysis framework for low-resource languages. In Proceedings of the Intelligent Systems Conference, Amsterdam, The Netherlands, 7–8 September 2023; Springer Nature: Cham, Switzerland, 2023; pp. 841–858. [Google Scholar]
  58. Aivatoglou, G. Aspect-Based Sentiment Analysis in Greek Data. Ph.D. Thesis, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2022. [Google Scholar]
  59. Zhang, B.; Fu, X.; Luo, C.; Ye, Y.; Li, X.; Jing, L. Cross-domain aspect-based sentiment classification by exploiting domain-invariant semantic-primary feature. IEEE Trans. Affect. Comput. 2023, 14, 3106–3119. [Google Scholar] [CrossRef]
  60. Bordoloi, M.; Biswas, S.K. Sentiment analysis: A survey on design framework, applications and future scopes. Artif. Intell. Rev. 2023, 56, 12505–12560. [Google Scholar] [CrossRef] [PubMed]
  61. Xiao, L.; Mao, R.; Zhao, S.; Lin, Q.; Jia, Y.; He, L.; Cambria, E. Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis. IEEE Trans. Affect. Comput. 2025. Available online: https://arxiv.org/pdf/2504.15848 (accessed on 23 April 2025). [CrossRef]
  62. Abdulaziz, M.; Alotaibi, A.; Alsolamy, M.; Alabbas, A. Topic Based Sentiment Analysis for COVID-19 Tweets. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 443–449. [Google Scholar] [CrossRef]
  63. Ficamos, P.; Liu, Y. A Topic Based Approach for Sentiment Analysis on Twitter Data. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 12. [Google Scholar] [CrossRef]
  64. Pathak, A.R.; Pandey, M.; Rautaray, S. Topic-level sentiment analysis of social media data using deep learning. Appl. Soft Comput. 2021, 108, 107440. [Google Scholar] [CrossRef]
  65. Mujahid, M.; Lee, E.; Rustam, F.; Washington, P.B.; Ullah, S.; Reshi, A.A.; Ashraf, I. Sentiment analysis and topic modeling on tweets about online education during COVID-19. Appl. Sci. 2021, 11, 8438. [Google Scholar] [CrossRef]
  66. Bansal, B.; Srivastava, S. On predicting elections with hybrid topic based sentiment analysis of tweets. Procedia Comput. Sci. 2018, 135, 346–353. [Google Scholar] [CrossRef]
  67. Klimiuk, K.; Czoska, A.; Biernacka, K.; Balwicki, Ł. Vaccine misinformation on social media–topic-based content and sentiment analysis of Polish vaccine-deniers’ comments on Facebook. Hum. Vaccines Immunother. 2021, 17, 2026–2035. [Google Scholar] [CrossRef] [PubMed]
  68. Kermanidis, K.L.; Tzimiris, S.; Nikiforos, S.; Nikiforos, M.N.; Mouratidis, D. ICT Adoption in Education: Unveiling Emergency Remote Teaching Challenges for Students with Functional Diversity Through Topic Identification in Modern Greek Data. Appl. Sci. 2025, 15, 4667. [Google Scholar] [CrossRef]
  69. Tzimiris, S.; Nikiforos, S.; Nikiforos, M.N.; Mouratidis, D.; Kermanidis, K.L. Topic Classification of Interviews on Emergency Remote Teaching. Information 2025, 16, 253. [Google Scholar] [CrossRef]
  70. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  71. Landis, J.R.; Koch, G.G. An Application of Hierarchical Kappa-Type Statistics in the Assessment of Majority Agreement among Multiple Observers. Biometrics 1977, 33, 363–374. [Google Scholar] [CrossRef] [PubMed]
  72. Valdivia, A.; Luzón, M.V.; Cambria, E.; Herrera, F. Consensus vote models for detecting and filtering neutrality in sentiment analysis. Inf. Fusion 2018, 44, 126–135. [Google Scholar] [CrossRef]
  73. Shahzad, M.; Freeman, C.; Rahimi, M.; Alhoori, H. Predicting Facebook sentiments towards research. Nat. Lang. Process. J. 2023, 3, 100010. [Google Scholar] [CrossRef]
  74. Lazrig, I.; Humpherys, S.L. Using Machine Learning Sentiment Analysis to Evaluate Learning Impact. Inf. Syst. Educ. J. 2022, 20, 13–21. [Google Scholar]
  75. Mosbach, M.; Andriushchenko, M.; Klakow, D. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv 2020, arXiv:2006.04884. [Google Scholar]
  76. Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine learning algorithm validation with a limited sample size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef] [PubMed]
  77. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  78. Zhang, T.; Wu, F.; Katiyar, A.; Weinberger, K.Q.; Artzi, Y. Revisiting few-sample BERT fine-tuning. arXiv 2020, arXiv:2006.05987. [Google Scholar]
  79. Tinn, R.; Cheng, H.; Gu, Y.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Fine-tuning large neural language models for biomedical natural language processing. Patterns 2023, 4, 100729. [Google Scholar] [CrossRef] [PubMed]
  80. NLPAUEB. Greek-BERT: Bert-Base-Greek-Uncased-v1. 2021. Available online: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 (accessed on 22 March 2025).
  81. PaloServices. PaloBERT: Palobert-Base-Greek-Social-Media-Sentiment-v2. 2022. Available online: https://huggingface.co/pchatz/palobert-base-greek-social-media-sentiment-v2 (accessed on 22 March 2024).
  82. Karampatakis, S. Greek Sentiment Analysis. 2022. Available online: https://huggingface.co/sotiris-karampatakis/greek-sentiment-analysis (accessed on 22 March 2025).
  83. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. 2020. Available online: https://huggingface.co/xlm-roberta-base (accessed on 22 March 2024).
  84. Google Research. Multilingual BERT. 2020. Available online: https://huggingface.co/bert-base-multilingual-cased (accessed on 22 March 2025).
  85. Hugging Face. NLI-XLM-R-Greek. 2023. Available online: https://huggingface.co/lighteternal/nli-xlm-r-greek (accessed on 25 March 2025).
  86. Waskom, M.L. Seaborn: Statistical data visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Figure 1. Pipeline of the sentiment classification process.
Figure 1. Pipeline of the sentiment classification process.
Electronics 14 02957 g001
Figure 2. Sentiment distribution before and after preprocessing. The left chart shows the distribution of sentiments before preprocessing, including positive, negative, and neutral. The right chart represents the distribution after preprocessing, where the neutral sentiment has been removed.
Figure 2. Sentiment distribution before and after preprocessing. The left chart shows the distribution of sentiments before preprocessing, including positive, negative, and neutral. The right chart represents the distribution after preprocessing, where the neutral sentiment has been removed.
Electronics 14 02957 g002
Figure 3. Pie chart showing the distribution of topic classes across all datasets.
Figure 3. Pie chart showing the distribution of topic classes across all datasets.
Electronics 14 02957 g003
Figure 4. Heatmap illustrating the positive, negative, and overall F1 scores of the four BERT-based models on the combined datasets.
Figure 4. Heatmap illustrating the positive, negative, and overall F1 scores of the four BERT-based models on the combined datasets.
Electronics 14 02957 g004
Figure 5. ROC curve comparison of all BERT-based models on the combined datasets.
Figure 5. ROC curve comparison of all BERT-based models on the combined datasets.
Electronics 14 02957 g005
Figure 6. F1 score heatmaps showing sentiment classification performance across topic classes and models.
Figure 6. F1 score heatmaps showing sentiment classification performance across topic classes and models.
Electronics 14 02957 g006
Table 1. Summary of Greek Sentiment Analysis and ABSA studies.
Table 1. Summary of Greek Sentiment Analysis and ABSA studies.
PaperDatasetMethod/ModelFindings/Contribution
Tsakalidis et al. [37]Greek tweetsSentiment lexicon & annotated corpusFoundational resources for Greek SA & benchmarking ML models
Pontiki et al. [38]SemEval 2016 ABSA datasetMultilingual ABSA task using ML & lexiconsGreek part included aspect annotations; standard benchmark for ABSA
Michailidis [15]Greek product reviewsComparative analysis of GreekBERT vs. traditional MLGreekBERT significantly outperformed traditional classifiers in sentiment classification tasks on Greek language reviews
Pavlopoulos [39]Restaurant, hotel & laptop reviews3-stage ABSA pipeline with rule-based & supervised MLBenchmark datasets for ATE, aggregation, polarity estimation; unsupervised ATE methods using word vectors; SEMEVAL tasks with competitive results
Rokkou et al. [41]Greek corpus (unspecified)Tensor decomposition + ML classifiers (J48, k-NN)Effective SA using multidimensional analysis
Outsios et al. [42]Greek Web corpus, Greek Word Analogy, WordSim353 datasetsEvaluation of Greek word2vec in SABetter embeddings led to improved classifier performance
Koutsikakis et al. [43]GreekBERT pretrained corpusGreekBERT Transformer modelImproved downstream NLP performance; foundation for Greek SA
Karavangeli et al. [44]Large-scale Greek corpusDistilled BERT for Greek with fewer parametersCompetitive performance for Greek SA with faster inference
Chatzimina et al. [45]Greek clinical dialoguesCompared BERT, RoBERTa, GPT-2, XLNetGreekBERT & mBERT outperformed others on healthcare SA
Kapoteli et al. [46]Greek COVID-19 tweetsFine-tuned GreekBERT for vaccine sentimentStrong performance on pandemic-related SA
Markopoulos et al. [47]1800 TripAdvisor reviewsSVM with TF–IDFHigh accuracy (95.8%) in binary classification of reviews
Charalampakis et al. [48]126 political tweetsNB, RF, functional trees for irony detectionBest model achieved 82.4% accuracy; political SA exploration
Athanasiou and Maragoudakis [49]740 Proto Thema commentsGradient Boosting + bilingual featuresHybrid English–Greek features outperformed baseline ML
Giatsoglou et al. [50]MOBILE-PAR and MOBILE-SEN Greek mobile product reviewsWord2Vec + lexicon + SVM83.6% accuracy; improved performance using hybrid features
Alexandridis et al. [14]44k social media postsFine-tuned GreekBERT and GPT-2Binary SA accuracy  99%, three-class  80%; showed transformer advantages
Kotsifakou and Sotiropoulos [51]Greek parliamentary speechesFine-tuned BERTOutperformed traditional ML in thematic classification
Chantavaridou and Kapidakis [52]Greek political tweetsML with/without lemmatizationLemmatization improved SA accuracy for morphologically rich Greek
Bilianos [53]Skroutz reviewsGreekBERT vs. NB/SVMGreekBERT achieved 97% accuracy, outperforming traditional methods
Dontaki et al. [54]61,109 vaccine tweetsML models + TextBlob featuresDecision Tree with TextBlob achieved 99.97% accuracy
Patsiouras et al. [55]GreekPolitics datasetGreekBERT for polarity/aggressiveness88% accuracy in three-way sentiment + political dimensions
Katika et al. [56]Greek Long COVID tweetsGreekBERT + topic modelingAchieved 94% sentiment accuracy; public health insight
Aivatoglou et al. [57]Greek ABSA corpusEnd-to-end multilingual ABSA with BERTHigh precision/recall; tailored for low-resource languages
Aivatoglou [58]Greek social media ABSA dataBERT pipeline for aspects and sentimentsAdapted ABSA to Greek sociocultural context; domain-specific results
Table 2. Distribution of sentiment labels and topics (classes) across datasets.
Table 2. Distribution of sentiment labels and topics (classes) across datasets.
Label/ClassPSFDSCHTCH
Positive230343339
Negative601573806
Class 1149589363
Class 2330203366
Class 319762140
Class 415562276
Number of sentences labeled as positive or negative, and their distribution across the four classes: Class 1 = Material and Technical Conditions, Class 2 = Educational Dimension, Class 3 = Psychological/Emotional Dimension, Class 4 = Learning Difficulties and Emergency Remote Teaching.
Table 3. Performance of models on the PSFD dataset.
Table 3. Performance of models on the PSFD dataset.
ModelPositive ClassNegative ClassOverall
Precision Recall F1 Score Precision Recall F1 Score F1 Score
GreekBERT0.760.630.690.870.930.900.79
XLM-r-Greek0.650.670.660.870.860.870.76
mBERT0.860.390.540.810.980.880.71
Palobert0.640.350.450.790.930.850.65
Bold values indicate the best performance across the models.
Table 4. Performance of models on the SCH dataset.
Table 4. Performance of models on the SCH dataset.
ModelPositive ClassNegative ClassOverall
Precision Recall F1 Score Precision Recall F1 Score F1 Score
GreekBERT0.890.900.900.920.910.910.91
XLM-r-Greek0.850.840.850.870.880.880.86
mBERT0.880.650.750.760.930.840.79
Palobert0.770.600.670.720.850.780.73
Bold values indicate the best performance across the models.
Table 5. Performance of models on the TCH dataset.
Table 5. Performance of models on the TCH dataset.
ModelPositive ClassNegative ClassOverall
Precision Recall F1 Score Precision Recall F1 Score F1 Score
GreekBERT0.840.600.700.810.940.870.78
XLM-r-Greek0.730.710.720.840.850.840.78
mBERT0.600.660.630.800.760.780.70
Palobert0.660.430.520.730.880.800.66
Bold values indicate the best performance across the models.
Table 6. Sentiment classification performance of BERT-based models on the combined datasets.
Table 6. Sentiment classification performance of BERT-based models on the combined datasets.
ModelPositive ClassNegative ClassOverall F1AUC
Precision Recall F1 Score Precision Recall F1 Score Score
GreekBERT0.760.830.790.900.850.880.840.91
XLM-r-Greek0.720.810.760.890.820.860.810.86
mBERT0.690.630.660.810.850.830.740.81
PaloBERT0.550.570.560.760.750.760.660.71
Bold values indicate the best performance across the models.
Table 7. Performance of models on the PSFD dataset across topic classes (topic-based sentiment classification).
Table 7. Performance of models on the PSFD dataset across topic classes (topic-based sentiment classification).
ModelTopic ClassAccuracy (%)F1 Negative (%)F1 Positive (%)
GreekBERTClass 175.076.972.7
Class 285.088.379.1
Class 385.791.874.4
Class 485.190.968.8
XLM-r-GreekClass 183.380.085.7
Class 278.371.182.7
Class 388.660.093.3
Class 476.642.185.3
mBERTClass 181.688.550.0
Class 279.388.060.9
Class 381.786.756.5
Class 482.989.261.5
PalobertClass 183.386.777.8
Class 263.374.455.3
Class 391.495.257.1
Class 478.787.876.7
Class 1: Material and Technical Conditions; Class 2: Educational Dimension; Class 3: Psychological/Emotional Dimension; Class 4: Learning Difficulties and ERT. Bold values indicate the highest scores.
Table 8. Performance of models on the SCH dataset across topic classes (topic-based sentiment classification).
Table 8. Performance of models on the SCH dataset across topic classes (topic-based sentiment classification).
ModelTopic ClassAccuracy (%)F1 Negative (%)F1 Positive (%)
GreekBERTClass 191.393.088.0
Class 289.282.092.0
Class 384.690.067.0
Class 475.078.060.0
XLM-r-GreekClass 183.085.078.0
Class 271.084.071.0
Class 376.088.074.0
Class 474.072.080.0
mBERTClass 181.7571.0087.00
Class 272.9778.0064.00
Class 384.6267.0090.00
Class 483.3391.0079.00
PalobertClass 172.2280.0057.00
Class 281.0872.0086.00
Class 369.2380.0073.00
Class 466.6770.0080.00
Class 1: Material and Technical Conditions; Class 2: Educational Dimension; Class 3: Psychological/Emotional Dimension; Class 4: Learning Difficulties and ERT. Bold values indicate the highest scores.
Table 9. Performance of models on the TCH dataset across topic classes (topic-based sentiment classification).
Table 9. Performance of models on the TCH dataset across topic classes (topic-based sentiment classification).
ModelTopic ClassAccuracy (%)F1 Negative (%)F1 Positive (%)
GreekBERTClass 173.9178.5766.67
Class 280.9586.0570.00
Class 392.3196.0080.00
Class 485.9289.8077.27
XLM-r-GreekClass 172.4675.3268.85
Class 279.3783.1273.47
Class 384.6291.6585.00
Class 485.9289.3679.17
mBERTClass 160.8764.0057.14
Class 269.8474.6762.75
Class 388.4693.8888.00
Class 478.8782.7672.73
PalobertClass 169.5774.7061.82
Class 266.6776.4063.24
Class 384.6291.6785.00
Class 473.2481.5561.28
Class 1: Material and Technical Conditions; Class 2: Educational Dimension; Class 3: Psychological/Emotional Dimension; Class 4: Learning Difficulties and ERT. Bold values indicate the highest scores.
Table 10. Performance of BERT-based models across all topic classes in all datasets.
Table 10. Performance of BERT-based models across all topic classes in all datasets.
ModelTopic ClassAccuracy (%)F1 Negative (%)F1 Positive (%)
GreekBERTClass 160.8764.0057.14
Class 269.8474.6762.75
Class 388.4693.8887.00
Class 478.8782.7672.73
XLM-r-GreekClass 181.4986.5270.49
Class 282.7382.2483.19
Class 389.4793.9460.00
Class 477.5068.1886.96
mBERTClass 179.6985.9263.59
Class 269.0971.1966.67
Class 392.1195.5266.67
Class 460.0061.1174.19
PaloBERTClass 170.4478.5052.67
Class 266.3668.9163.37
Class 378.9587.5063.33
Class 447.5055.0464.41
Class 1: Material and Technical Conditions; Class 2: Educational Dimension; Class 3: Psychological/Emotional Dimension; Class 4: Learning Difficulties and ERT. Bold values indicate the highest scores.
Table 11. Examples of correct and incorrect predictions.
Table 11. Examples of correct and incorrect predictions.
TextActual SentimentPredicted SentimentActual TopicPredicted Topic
Aισθανόμουν μεγάλο φόρτο. (I felt a heavy burden.)NegativeNegativeClass 3Class 3
Υπήρξε μεγάλη αλληλοβοήθεια για τα τεχνικά και συνεργασία από όλους. (There was great mutual help with the technical aspects and collaboration from everyone.)PositivePositiveClass 1Class 1
Δεν σας κρύβω πως πανικοβλήθηκα. (I won’t hide from you that I panicked.)NegativeNegativeClass 3Class 1
Ήταν μια πολύτιμη εμπειρία που μας έδειξε πόσο τεχνολογικά απροετοίμαστο ήταν το σύστημα και πόσο μόνοι μας ήμασταν τελικά. (It was a valuable experience that showed us how technologically unprepared the system was and how alone we truly were.)NegativePositiveClass 1Class 1
Green color indicates correct model predictions, while red highlights incorrect ones. Class 1: Material and Technical Conditions; Class 2: Educational Dimension; Class 3: Psychological/Emotional Dimension.
Table 12. Comparison with relevant research.
Table 12. Comparison with relevant research.
ResearchDataClassesModelResults
Michailidis [15]Greek product reviews (e-commerce)2 (Pos/Neg)GreekBERTF1 = 0.96
Chatzimina et al. [45]Greek clinical conversations3 (Pos/Neu/Neg)BERTF1 = 0.95
Bilianos [53]Skroutz reviews (480 entries)2 (Pos/Neg)GreekBERT+SVMF1 = 0.97
Patsiouras et al. [55]GreekPolitics tweets (2.5k tweets)3 (Pos/Neu/Neg)GreekBERTF1  = 0.83
Katika et al. [56]Greek Long COVID tweets (937 tweets)2 (Pos/Neg)GreekBERTAccuracy = 0.94
Our Study (2025)Greek Interviews2 (Pos/Neg)GreekBERTF1 = 0.91
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tzimiris, S.; Nikiforos, S.; Nikiforos, M.N.; Mouratidis, D.; Kermanidis, K.L. A Comparative Evaluation of Transformer-Based Language Models for Topic-Based Sentiment Analysis. Electronics 2025, 14, 2957. https://doi.org/10.3390/electronics14152957

AMA Style

Tzimiris S, Nikiforos S, Nikiforos MN, Mouratidis D, Kermanidis KL. A Comparative Evaluation of Transformer-Based Language Models for Topic-Based Sentiment Analysis. Electronics. 2025; 14(15):2957. https://doi.org/10.3390/electronics14152957

Chicago/Turabian Style

Tzimiris, Spyridon, Stefanos Nikiforos, Maria Nefeli Nikiforos, Despoina Mouratidis, and Katia Lida Kermanidis. 2025. "A Comparative Evaluation of Transformer-Based Language Models for Topic-Based Sentiment Analysis" Electronics 14, no. 15: 2957. https://doi.org/10.3390/electronics14152957

APA Style

Tzimiris, S., Nikiforos, S., Nikiforos, M. N., Mouratidis, D., & Kermanidis, K. L. (2025). A Comparative Evaluation of Transformer-Based Language Models for Topic-Based Sentiment Analysis. Electronics, 14(15), 2957. https://doi.org/10.3390/electronics14152957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop