LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification

Pilicita, Anabel; Barra, Enrique

doi:10.3390/mti9050044

Open AccessArticle

LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification

by

Anabel Pilicita

and

Enrique Barra

^*

Departamento de Ingeniería de Sistemas Telemáticos, Escuela Técnica Superior de Ingenieros de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2025, 9(5), 44; https://doi.org/10.3390/mti9050044

Submission received: 17 March 2025 / Revised: 5 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

Download

Browse Figures

Versions Notes

Abstract

The incorporation of artificial intelligence in educational contexts has significantly transformed the support provided to students facing learning difficulties, facilitating both the management of their educational process and their emotions. Additionally, online comments play a vital role in understanding student feelings. Analyzing comments on social media platforms can help identify students in vulnerable situations so that timely interventions can be implemented. However, manually analyzing student-generated content on social media platforms is challenging due to the large amount of data and the frequency with which it is posted. In this sense, the recent revolution in artificial intelligence, marked by the implementation of powerful large language models (LLMs), may contribute to the classification of student comments. This study compared the effectiveness of a supervised learning approach using five different LLMs: bert-base-uncased, roberta-base, gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, and gpt-neo-125m. The evaluation was carried out after fine-tuning them specifically to classify student comments on social media platforms with anxiety/depression or neutral labels. The results obtained were as follows: gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 obtained 98.93%, roberta-base 98.14%, bert-base-uncased 97.13%, and gpt-neo-125m 96.43%. Therefore, when comparing the effectiveness of these models, it was determined that all LLMs performed well in this classification task.

Keywords:

LLMs; NLP; transformers; education; BERT; GPT

1. Introduction

Artificial intelligence (AI) enhances accessibility and learning experiences by reducing educational barriers, fostering inclusive environments, and promoting academic success. It also enables educators to personalize study plans, provide immediate feedback, and support students’ academic and emotional development [1].

Social media applications help users around the world communicate and are spaces for sharing ideas, information, knowledge, and other data [2]. Some of the most widely used social media applications are Facebook, Twitter, and Instagram [3]. Nowadays, it is possible to extract opinions, comments, or reviews left by students on them, data that are very useful for identifying emotions.

In the educational field, students’ emotions play a crucial role in the learning process because they can enhance or undermine students’ ability to learn or remember what they have learned [4]. Furthermore, one in seven young people between the ages of 10 and 19 suffers from a mental disorder, with depression and anxiety being the main causes [5]. Therefore, early identification of feelings of anxiety or depression could contribute to students’ emotional well-being, academic success, and overall development as individuals [6].

One of the tasks of natural language processing (NLP) is text classification, which consists of the categorization and organization of texts [7]. Therefore, students’ comments on social media platforms can be categorized. However, classifying every review left by students on social media platforms with a large user base is a difficult, time-consuming, and expensive task.

Advanced machine learning models have been applied for text classification tasks, such as logistic regression [8], transfer learning models [9], and naïve Bayes [10]. However, these models have limitations in their ability to capture complex and contextual relationships in text data. The deep-learning approach, based on convolutional neural networks (CNNs) and long short-term memory (LSTM) networks, has been applied to various NLP tasks, including text classification [11]. This architecture has demonstrated remarkable performance. However, their effectiveness depends heavily on data availability [12]. In other words, deep learning-based classification models achieve higher levels of performance as the volume of training data increases. Consequently, this presents a significant challenge because the availability of labeled datasets, which are essential for training effective models, remains a considerable obstacle.

However, in recent years, large language models (LLMs) have been efficiently applied to text classification tasks. Among the various LLMs [13], the most popular are the bidirectional encoder representations from transformer (BERT) family and the generative pretrained transformer (GPT) family. Both families use the most advanced methods for various NLP tasks such as text classification [14], text generation [15], or text summarization [16].

The increasing use of social networks among students has opened new opportunities to monitor their emotional well-being using sentiment analysis techniques. However, this task presents significant technical and ethical challenges. From the inherent ambiguity of informal language [17] to privacy and consent concerns [18], applying artificial intelligence in this context requires methods that are both accurate and respectful of individual rights. This study sought to evaluate the effectiveness of large-scale language models (LLMs) in student sentiment classification, contributing to a deeper understanding of their capabilities and limitations in education. Therefore, this study investigated the performance of different large language models in accurately classifying student comments on social media platforms that were depressive/anxious (1) or neutral (0). Given the novelty of LLMs and their diverse architectures, a systematic comparison is essential for evaluating their suitability for this sensitive task. Accurate classification is critical, as it provides valuable insights for educators, enabling them to offer timely and personalized support. This ensures that technology serves as a powerful tool for enhancing students’ well-being and academic success. This study sought to address the following research questions (RQs).

RQ1: What is the performance of LLMs in classifying student comments on social media into categories of anxiety, depression, or neutral sentiment?

RQ2: How can a large language model be effectively integrated into digital educational platforms to support automated detection of emotional risk signals in students?

In this work, the LLMs from the GPT family applied were: gpt-4o-mini-2024-07-18 [19], gpt-3.5-turbo-0125 [20], and open-source gpt-neo-125m from EleutherAI [21]. On the other hand, bert-base-uncased [22] and roberta-base [23] models from the BERT family were utilized. To address this challenge, there was a process of preprocessing the source data, fine-tuning the models with the same training dataset containing student comments previously labeled with sentiment, and determining the model with the best performance after making predictions with the test set assessing different metrics.

The remainder of this paper is organized as follows. Section 2 provides a brief review of the relevant literature. Section 3 describes the methodology used to facilitate the present investigation, including data collection from student reviews, dataset cleaning, fine-tuning, and evaluation metrics. Section 4 presents the results of the research. Section 5 discusses the implications, highlighting the novelty and contributions of the proposed approach. Lastly, Section 6 includes the conclusions and suggestions for future work.

2. Literature Review

2.1. Related Works

In text classification, the goal is to assign a label, category, or tag to a text body, which can be a comment, sentence, paragraph, or document. As automated text classification has become more widespread, classification types have reappeared for sentiment analysis, news classification, topic labeling, emotion detection, and offensive language labeling [24].

Classification of opinions is a multidisciplinary field that can be adapted to different educational domain applications, such as course evaluation, understanding student participation, educational infrastructure constraints, and educational policy decision-making [25,26]. In this sense, when information is extracted from reviews left by students, it can improve teaching and learning practices [27]. Yan et al. [28] and Du [29] conducted studies to analyze student feedback, revealing several critical factors that influence student satisfaction in virtual learning environments. These factors include course content, technical elements, difficulty level, instructor proficiency, video resources, course organization, and workload.

Recent scholarly focus has gravitated towards social media due to its expansive and diverse user base, providing individuals with a daily platform to express opinions on a myriad of topics. These platforms function as open forums in which students actively express their sentiments and viewpoints in educational contexts [30]. Certain studies have explored comments relevant to topics that influence students’ vocational training decisions. For example, Fouad et al. [31] explored the perceptions of women in STEM fields. Similarly, another study scrutinized posts on Reddit related to the radiation profession [32]. These collective findings suggest that predominantly positive feedback can significantly influence student enrollment decisions. However, contrary to these optimistic findings, several studies have unearthed critical perspectives on educational issues. Zhou and Mou [33] highlighted the high expectations of online learners, emphasizing the need for self-discipline and regulation. Students expressed discomfort with prolonged screen time during extended online sessions, as evidenced by sentiments characterized by keywords such as “drowsy” and “anxious.” Similarly, another study that examined student tweets revealed instances of stress experienced by students in online sessions during the COVID-19 pandemic [34].

The above anxiety [35] and depressive [36] disorders are closely related to students as they face substantial pressures both at school and within their families, leading to significant psychological stress that can culminate in severe mental health disorders. Consequently, the academic performance, physical well-being, and mental health of students can deteriorate significantly throughout their academic journey and may only be discernible at an advanced stage [37]. Anxiety can lead to feelings of apprehension about academic tasks, which ultimately affects student academic performance [38]. Furthermore, research such as Sad et al. [39], Namkung et al. [40], Barroso et al. [41], and Caviola et al. [42] have consistently demonstrated negative and statistically significant correlations between mathematics anxiety and mathematics achievement. Abdel Latif [43] highlighted anxieties related to the use of the English language by non-native speakers. Moreover, students who participate in computer programming activities often experience anxiety [44] due to the perceived difficulty of programming, which demands high levels of precision. Fear or apprehension regarding programming has been shown to impede students’ skill acquisition and academic performance. Therefore, there is a close relationship between students’ feelings of stress, anxiety, or depression and their academic performance.

2.2. Transformers for Large Language Models

Transformers and large language models are closely related in the field of natural language processing (NLP), which is a branch of artificial intelligence that serves as a bridge between human communication and machine understanding [45]. Transformers are the most widely adopted cutting-edge deep-learning architectures, particularly prevalent in NLP tasks [46]. The success of transformers lies in the fact that they are a neural network architecture that incorporates self-attention mechanisms and fully connected point layers, which allows high parallelization and consequent reduction in computational costs [47]. Recent research on neural machine translation has shown that optimizing attention and better handling of long dependence using techniques such as Ccontent-Adaptive Rrecurrent Units (CARUs) can significantly improve model performance [48]. These advances are also relevant for tasks such as sentiment classification, in which the correct interpretation of linguistic nuances is crucial.

Building on the foundation laid by transformers, numerous recent NLP models, known as large language models (LLMs), have emerged. Introduced in 2018, LLMs are built upon the transformer architecture (Birhane et al., 2023). [49]. These models, such as OpenAI’s GPT or Google’s BERT, are pretrained on extensive textual datasets. LLMs are distinguished by their deep-learning architecture, which entails a significant number of parameters trained in an unsupervised manner on large volumes of text [49]. It is well established that scaling language models by increasing training data, computational resources, and the number of parameters can enhance their performance and sample efficiency across a wide range of downstream NLP tasks. [20,22].

In essence, the relationship between transformers and LLMs is fundamental. LLMs, such as GPT and BERT, are constructed using transformer architecture. These models utilize multiple layers of self-attention and fully connected layers to efficiently process text sequences and capture the long-term relationships between words. This architectural framework enables LLMs to acquire high-quality representations of natural language from large amounts of unlabeled textual data, resulting in significant advances in NLP tasks such as sentiment analysis [50], question classification [51], text classification [52], machine translation, and information extraction. The practical applications of NLP vary, encompassing the creation of chatbots, virtual assistants, language translation services, content summarization tools, and sentiment analysis within social media contexts [53]. By harnessing the power of language, NLP facilitates the development of intelligent systems that can understand and interact with humans in a manner that feels natural and intuitive. This is the case in a study by Parker et al. [54], who evaluated the use of large language models, such as GPT-4 and GPT-3.5, to analyze educational surveys and apply them to tasks such as classification, information extraction, thematic analysis, and sentiment analysis. It is important to note that academic opinions are extracted when surveys are conducted. However, the focus of the present work is on classifying student comments in a more informal and spontaneous environment, such as social networks.

2.2.1. Generative Pretrained Transformer (GPT) Model

Generative pretrained transformers (GPTs) represent a significant leap in the field of natural language processing (NLP). Developed by OpenAI and introduced in 2018, these models have revolutionized the way machines understand and generate human language [55]. Furthermore, training on large amounts of textual data allows GPT models to gain a deep understanding of language patterns and relationships [45]. These types of models are unidirectional, meaning that they only process context by looking backward in a sentence and not forward. Pretraining is autoregressive, wherein the model predicts the next word in a text sequence given a previous sequence of words. This enables the model to generate text in a coherent and relevant manner [56]. Moreover, GPT models exhibit versatility, as they can undergo fine-tuning for specific NLP tasks, thus amplifying their efficacy and adaptability across diverse domains. These models demonstrate exceptional proficiency across a spectrum of NLP applications, encompassing text generation, language translation, content summarization, and text completion [56]. Taking advantage of the capabilities inherent in GPT models, NLP systems have achieved groundbreaking advances in language comprehension, text generation precision, and the provision of customized and contextually relevant results to users [57].

The GPT family of models developed by OpenAI includes both paid and open-source versions, catering to a wide range of users and applications. GPT models are designed using transformer architectures that rely on self-attention mechanisms to generate high-quality text based on large-scale pretraining [47]. For example, GPT 4o-mini and GPT-3.5 are available through a paid API offered by OpenAI, allowing users to integrate advanced language generation into their applications. This commercial version is widely used for content creation, customer service, and coding. On the open-source side, models such as GPT-2 have been released with free access, enabling developers and researchers to experiment with large language models in various applications. Other open-source alternatives inspired by GPT include the GPT-Neo and GPT-J models, which offer comparable performances without the constraints of a proprietary API [21]. These models, whether commercial or open-source, have opened the door to a wide range of applications in artificial intelligence, although they also raise ethical concerns regarding potential misuse.

2.2.2. Bidirectional Encoder Representations from Transformers (BERT)

The BERT model, released by Google in 2019, was designed for deep bidirectional training based on the transformer architecture [22]. One of the primary reasons for the success of BERT is its context-based integration, which allows it to process the context of words by examining the preceding and subsequent words in a sentence [46]. Furthermore, as a supervised and masked pretraining model, BERT undergoes pretraining using the task of masking words in a sentence and subsequently predicting them. This model has yielded promising results for various NLP tasks, including sentiment classification, intent detection, and sentence classification [58]. Since its introduction, BERT has inspired several variations to improve the efficiency, scalability, and performance of specific tasks. RoBERTa (robustly optimized BERT approach) improves upon BERT by optimizing training strategies, including longer training on more data and removing the next-sentence-prediction task, which leads to better performance on a range of NLP benchmarks [23]. ALBERT reduces the model’s size and memory usage by sharing parameters across layers and factorizing embeddings, making it more efficient without sacrificing too much accuracy [59]. DistilBERT is a distilled version of BERT that is smaller, faster, and more lightweight while retaining 97% of BERT’s performance, making it useful for applications where speed and resource constraints are critical [60]. Other variants, such as TinyBERT [61] and MiniLM [62], are designed for mobile or edge devices to optimize BERT’s capabilities of BERT for environments with limited computing power. These models, which focus on optimization, efficiency, and speed, continue to expand BERT’s applicability across both research and industry, making it foundational in modern NLP applications.

3. Research Methodology

This study aimed to classify the feelings of students obtained from social media applications. As in the field of computer vision, where multiple classifiers have been introduced to improve the accuracy and convergence of concatenated networks in natural language processing, it is critical to optimize the underlying architectures to more effectively capture complex semantic patterns, such as sentiments expressed in informal language [63]. Our study followed this line of thinking by exploring the performance of different large-scale language model architectures in emotional classification.

The bert-base-uncased and roberta-base models from the BERT family were selected for sentiment text classification because of their robust performance and versatility in understanding contextual nuances. Additionally, gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, and gpt-neo-125m from the GPT family were selected for sentiment text classification because of their advanced language understanding and generation capabilities. The gpt-4o-mini-2024-07-18 model offers a compact, yet powerful version of GPT-4, making it efficient for tasks requiring nuanced sentiment analysis. Also, gpt-3.5-turbo-0125 provides a balance between performance and computational efficiency, leveraging its extensive training on diverse datasets to interpret and classify sentiments accurately. Finally, gpt-neo-125m, an open-source alternative, is highly customizable and has shown strong performance in various NLP tasks, including sentiment analysis. Therefore, five LLM models were applied in this work: bert-base-uncased, roberta-base, gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, and gpt-neo-125m. The entire code base used for the research, ranging from data preprocessing to the fine-tuning process of the models, and metrics applied to determine their performance are available on the GitHub repository with open access “https://github.com/Anabel-Pilicita/fine-tuning-sentiment-student (accessed on 29 December 2024)”.

To evaluate the performance of the models in making label predictions and to make meaningful comparisons based on the results, it is imperative to adopt a specific research methodology. This approach was deliberately chosen to streamline the software development process as well as the implementation, fine-tuning, and optimization of the models. A comprehensive outline of these steps is illustrated in Figure 1 and elaborated in the following subsections.

3.1. Dataset

We based our research on the student-depression-text corpus, a dataset published in 2023 by the Kaggle platform, which is an online community for data scientists [64]. This public dataset contains 7498 comments from students aged 15 to 17 years who posted on Facebook. The available information did not detail the exact procedure used for the classification. Therefore, to assess the consistency of the tags, a manual review was performed to confirm that most of the short comments were consistent with the content of the tags. In the same way, cleaning of the dataset was performed in order not to consider the unreadable comments. The input dataset consisted of five columns: text, label, Age, Gender, and Age Category. The text column of the object data type contained student comments. The label column assigns a numeric value between 0 and 1, categorizing each comment as either neutral or related to anxiety or depression. The Age column stores students’ ages as numeric values, while the Gender column, which is also an object data type, indicates the gender of the students (male or female). Finally, the Age Category column classifies students into categories based on their age. The structure of the dataset is shown in Figure 2.

This study primarily focused on analyzing the data in the text and label columns, as they were critical for achieving the research objectives. The text column contains the students’ comments, and the label column is represented by a value of 0 if the comment is neutral and 1 if the opinion denotes a feeling of anxiety or depression. Most of the comments were short texts, such as those shown in Table 1. Therefore, supervised training was performed in this study because all records already had labels with assigned sentiments.

3.2. Dataset Splitting and Preprocessing

Data preprocessing is essential to ensure the quality and effectiveness of predictive modeling and fine-tuning processes [65]. One of the main tasks of data preprocessing was to remove duplicate entries and leftover blank labels. After this phase, the corpus was reduced to 7476 entries with two labels (0: normal; 1: anxiety/depression).

The strategy for splitting the dataset into training (80%) and test (20%) sets is a common practice in NLP model training, especially with large language models (LLMs), owing to the trade-off between efficiency and generalization capability [66]. This partition produced an initial split of 5980 entries for the training set and 1496 entries for the test set. The training set plays a key role during the tuning phase, when the model acquires patterns, relationships, and representations of the input data, allowing it to make predictions or execute specific tasks [67]. However, the test set allows predictions to be made and the performance of the model to be evaluated [67].

To validate the stability of the results, an additional experiment was performed using k-fold cross-validation. Given the high computational cost and training time limitations of some models, 5-fold cross validation was applied to only one model of each family evaluated in this study: roberta-base of the BERT family and gpt-neo-125m of the GPT family. These are two models that can be trained locally in an efficient way, allowing evaluation of the stability and generalization of the method without compromising the available resources. To ensure the representativeness of the results and minimize bias in data selection, a stratified 5-fold cross-validation was applied using StratifiedKFold from scikit-learn library (version 1.3.2). This method preserves the original distribution of classes in each subset of the dataset, thereby allowing for a more reliable evaluation [68]. The procedure consisted of dividing the dataset into five parts of similar size, maintaining the proportion of classes. In each iteration, four folds were used for training and one for testing, repeating the process five times. At the end of each iteration, the evaluation criteria were calculated and averaged to obtain model performance.

3.3. Model Deployment and Fine-Tuning LLMs

Five LLM models were used in this study. The first three are GPT models: the gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, and gpt-neo-125m models. The other two models are based on BERT: the bert-base-uncased and the roberta-base models. The selection of the models responded to different criteria: while bert-base-uncased and roberta-base represent established standards in text classification tasks and gpt-3.5-turbo-0125 and gpt-4o-mini-2024-07-18 stand out for their generative processing capabilities, gpt-neo-125m was included as a smaller, open-architecture alternative. The choice of gpt-neo-125m was based on its accessibility and computational efficiency, which allows exploration of the performance of lighter models against larger-scale and complex options, thus providing a valuable contrast for analysis.

For each of these models, a specific fine-tuning and prediction method was applied for student feedback sentiments. Fine-tuning adapts pretrained models to specific tasks or applications [69]. This refinement of LLMs involves improving pretrained models using smaller task-specific datasets to improve performance on a target task [70]. In this study, a supervised training process was applied, in which the models were trained on labeled student comments. This process allowed the model to become familiar with the terminologies, sentiment classification, and common structures found in the training set to perform sentiment preconditioning in the test set. Each model was trained using a dataset composed of student comments classified into two categories: anxiety/depression (1) and neutral (0). The process included tokenization of texts with architecture-specific tokenizers and adaptation of the output layers for binary classification tasks. Subsequently, the models were trained using 80% and 20% of the data for training and validation, respectively. In addition, in some cases, fivefold cross-validation was performed to assess the consistency of the results. The fitting was carried out for a maximum of three epochs, adjusting the learning rates according to the characteristics of each model and optimizing the hyperparameters through internal validations. This methodology allowed for a fair and controlled comparison of the different architectures evaluated.

The training and prediction processes of the bert-base-uncased, roberta-base, and gpt-neo-125m models were run on a computer with an 11th Gen Intel (R) Core (TM) i7-1165G7 2.80 GHz processor and 64 GB of RAM. The gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 models were used through the OpenAI API because these two models require training on GPUs due to their large scale, advanced architecture, and high demand for parallelism. Additionality, gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 models do not allow local fine-tuning: their use is limited to default configurations via API calls. However, the present study provides a functional analysis of the performance of each architecture for the same emotional classification task.

3.3.1. gpt-4o-mini-2024-07-18

The gpt-4o-mini-2024-07-18 is OpenAI’s most advanced and cost-efficient small model. Its low cost and low latency make it suitable for tasks like real-time support, handling long contexts, and parallel processing [71]. Predictions and fine-tuning were carried out using the OpenAI API. The model was trained using the training and validation sets. During the fine-tuning process, 593,229 tokens were processed, resulting in a training loss of 0.0985 across three epochs. These results indicated effective learning during training. Upon completion of the fine-tuning phase, the task of sentiment label prediction on the test set was executed. Two JsonL files were created with prompts and completions for each record, as depicted in Figure 3.

3.3.2. gpt-3.5-turbo-0125

OpenAI developed gpt-3.5-turbo-0125 autoregressive models with 175 billion parameters that understand and generate natural language or code [72]. To make predictions and perform fine-tuning on the base model gpt-3.5-turbo-0125, we used the official OpenAI API. Initially, the base model used the training and validation set to adapt to the specific patterns. A fine-tuning process handled 593,229 tokens with a training loss of 0 over three epochs. These metrics indicate the adaptability and generalization of the models during the fit. After completing the fine-tuning phase, the task of predicting the sentiment labels for the student’s comments on the test set was assigned. To facilitate the fitting of the LLMs, two JsonL files were created, comprising the prompts and completion of each record for prediction, as gpt-4o-mini-2024-07-18.

3.3.3. gpt-neo-125m

An open-source gpt-neo-125m autoregressive model developed by EleutherAI with a GPT-3 Transformer architecture [73] was applied. The EleutherAI/gpt-neo-125m model has 125 million parameters, and its main functionality is to take a text string and predict the following text. For the training process, AutoModelForSequenceClassification from the transformer library was applied, and tokenization was performed considering special tokens. For the fitting process, a sentiment classification model was used using the training set formed by students’ comments. The hyperparameters for the fine-tuning process included a learning rate o 2 × 10⁻⁵, a batch size of 16, and a total of three epochs. After the fitting process, the model began to generate predictions for the test samples.

3.3.4. bert-base-uncased

A variant of the BERT model, known as bert-base-uncased [74], available on the Hugging Face platform, was used. This model was trained with 110 million parameters and was not case-sensitive: in this case, all text inputs were converted to lowercase during training [75]. The self-attenuation mechanisms built into bert-base-uncased facilitate the capture of contextual dependence within the input sequences. Furthermore, the bert-base-uncased tokenizer was applied for the tokenization of the training set [76]. For the fine-tuning process, the following hyperparameters were included: a learning rate of 2 × 10⁻⁵, batch size of 16, and a total of three training epochs. After the fine-tuning process, the model generated predictions for the test sample.

3.3.5. roberta-base

In this phase, the roberta-base model, available on the Hugging Face platform [77], was employed. It is a model based on BERT architecture with a total of 125 million training parameters and incorporates numerous layers of self-attenuation mechanisms, which facilitates the effective capture of contextual relationships within input sequences [75]. In this process, the roberta-base tokenizer was applied to tokenize text data from the training set. The hyperparameters of the fine-tuning process included a learning rate of 2 × 10⁻⁵, a batch size of 16, and three training epochs. After fine-tuning, the model was used to generate predictions for the test set.

3.4. Evaluation Criteria

In this study, we chose to use recognized metrics in classification tasks, such as accuracy, precision, recall, and F1 score, because of the need to evaluate model performance from different perspectives. Accuracy provides a general measure of accuracy over total predictions, but can be misleading in imbalanced datasets. Therefore, it was supplemented with precision, which measures the proportion of true positives over predicted positives, and recall, which assesses the model’s ability to correctly identify all positive instances. The F1 score, which is the harmonic mean between precision and recall, provides a balanced indicator that is especially useful in scenarios in which it is crucial to minimize both false positives and false negatives, such as in the detection of sentiments associated with critical emotional states. The combination of these metrics allows for a more complete and accurate assessment of the performance of the analyzed models.

Accuracy is a basic evaluation criterion in the classification task. It is the ratio of correct prediction among the total number of cases examined. Accuracy is calculated as indicated by Equation (1), where the true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) values are considered.

Accuracy = \frac{TN + TP}{TP + FP + FN + TN}

(1)

The precision metric refers to the proportion of true-positive predictions out of all the instances that were predicted as positive, representing the ratio of true positives to the sum of true positives and false positives, defined by Equation (2). Recall, as indicated by Equation (3), is the ratio of true positives to the sum of true positives and false negatives. Using Equations (2) and (3), F1 is calculated, as outlined in Equation (4), which is a widely used metric in the assessment of supervised learning algorithms, to be the harmonic mean of precision and recall. In this sense, precision and recall are combined into a single metric, the so-called F1, when both have the same weight, where one can set the weight that precision or recall will have in the calculation [78].

Precision = \frac{TP}{TP + FP}

(2)

Recall = \frac{TP}{TP + FN}

(3)

F 1 = 2 \times \frac{Precisio n \times Recall}{Precision + Recall}

(4)

4. Results

In the field of NLP, the assessment of models plays a crucial role, affording us invaluable insight into the performance and effectiveness of our fine-tuned models. Model evaluation serves as a reference, allowing us to make informed decisions about its suitability and driving us towards continuous advancements in fine-tuning and optimization for specific applications [67]. Table 2 shows the results of applying the traditional 80%–20% split strategy and displays a comprehensive range of evaluations for each model, covering a set of essential evaluation metrics. Table 3 shows the results of applying k-fold cross-validation applied in roberta-base and gpt-neo-125m.

In this research, LLMs were used for the task of predicting student sentiment. The main goal was to evaluate the performance of these models in prediction. After a rigorous examination of the responses generated by the gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, and gpt-neo-125m models, several notable observations were made. The accuracy metric measures how often a machine learning model correctly predicts the outcome [72]. The gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 models showed a satisfactory degree of accuracy of 98.93% in sentiment prediction, which translates into a successful prediction. On the other hand, the gpt-neo-125m model, while still showing remarkable performance, achieved a slightly lower accuracy rate of 96.46%.

In addition, we ran the same process on the bert-base-uncased and roberta-base models using the same test dataset. The results of this phase were quite effective, as the bert-base-uncased model, after fine-tuning, managed to correctly predict sentiments in 97.13% of the cases. Simultaneously, the roberta-base achieved a remarkable accuracy rate of 98.13%, showing a considerable improvement in its predictive capabilities, indicating that this model obtained the best result regarding this metric.

On the other hand, the fitted models were tested once again with the precision metric, which takes as a reference the opinions that the model has classified and that match the classification of the test set already established. It was observed that the gpt-4o-mini-2024-07-18 model obtained the highest precision of 98.96%, gpt-3.5-turbo-0125 98.93%, and roberta-base 93.12%. Finally, the gpt-neo-125m and bert-base-uncased models achieved similar results of 91.44% and 91.76%, respectively. In the subsequent results phase, the recall metric was taken into consideration, which measures how often the model correctly identifies positive instances (true positives) from all the actual positive samples in the dataset. The gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 models obtained the same percentage of 98.93%. From the same model, the roberta-base model exhibited a remarkable result of 96.62%. The next model, bert-base-uncased, also obtained 92.11%; however, gpt-neo-125m obtained the lowest value of 88.35%.

Following a comparison of the trained models, the F1 metric was considered for performance, as it helps to minimize both false positives and false negatives [79] and allows balancing the precision and recall metrics. In this study, the F1 metric was necessary because the number of labels with a value of 1 (anxiety or depression) was not equal to the number of labels with a value of 0 (normal). Therefore, the dataset was imbalanced. Subsequently, it was determined that gpt-4o-mini-2024-07-18, with 98.94%, performed best, followed by gpt-3.5-turbo-0125 with 98.93%, roberta-base with 98.14%, bert-base-uncased with 97.13%, and gpt-neo-125m with 96.43%.

The OpenAI models yielded the best results. In fact, the results of the gpt-4o-mini-2024-07-18 model were like gpt-3.5-turbo-0125. However, gpt-4o-mini is smaller, faster, and less expensive than processing, as indicated on the company’s website. The gpt-neo-125m model is an open-source model that predicts sentiment with an accuracy rate of 96.43%. The difference in the accuracy metric between the first and second models was 2.47%, indicating the superiority of the gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 models over the gpt-neo-125m model. This performance gap can be attributed to the distinct tuning processes applied via the OpenAI API, which may have enhanced their performance in different computational environments compared with gpt-neo-125m. Nonetheless, the results of the gpt-neo-125m model are impressive, demonstrating that open-source models can be competitive in sentiment classification tasks.

Similarly, the performance of roberta-base outperformed bert-based-uncased by 1.01%. This slight difference can be attributed to roberta-base being an enhanced variant, which means that it benefits from more extensive training and optimization. Consequently, the fine-tuning process demonstrates greater robustness and better generalization in sentiment classification tasks.

Overall, all models proved to be quite efficient, exceeding the 95% success rate for sentiment predictions on the dataset used in this study. These excellent results can be attributed to supervised learning, as the corpus was labeled and no blind prediction was performed. Comparing the results between the BERT and GPT families, the gpt-4o-mini-2024-07-18 and roberta-base models exhibited superior performance. However, if we compare these two models separately, a percentage difference of 0.8% favors the gpt-4o-mini-2024-07-18 model over the roberta-base model. This result indicates that when dealing with small-corpus sentiment classification tasks, both variants demonstrate high performance.

An important factor to consider is the cost of each model. In the case of gpt-4o-mini-2024-07-18 and like gpt-3.5-turbo-0125, being OpenAI models, there is an economic cost depending on the subscription plan and the number of times that fine-tuning is applied to the model within the API. There are free plans, but with a usage limit, which makes it restrictive. On the other hand, open-source models such as gpt-neo-125m, bert-based-uncased, and gpt-4o-mini-2024-07-18 have no associated cost, but the cost of the equipment where the models are executed as well as the execution time must be considered. The choice between the two options depends on the specific requirements of the project and the available budget.

The cross-validation analysis with roberta-base showed average accuracy of 98.62%, precision of 94.46%, recall of 97.38%, and F1 of 98.63%. In this case, the values are similar to those obtained with the fixed 80%–20% partition. On the other hand, cross-validation results with gpt-neo-125m indicated average accuracy of 94.18%, precision of 90.15%, recall of 72.13% and F1 of 93.83%. In comparison, the 80%–20% partition produced a slight variation in accuracy and recall values, because each partition can capture different patterns and outliers in the dataset. Overall, these results demonstrate that the variability between the different folds is minimal, which reinforces the robustness of the models.

Although it is common for cross-validation to improve the stability and overall performance of the models, in the specific case of gpt-neo-125m, the results obtained did not follow this trend. This behavior may be related to the model’s own characteristics: since it is a lighter architecture with lower representation capacity than the other models evaluated, it is possible that its performance is more affected by the variation in data distribution between folds. In addition, because it was not originally designed for classification tasks. Its adjustment through fine-tuning may be less efficient than models optimized for this type of task, such as roberta-base.

5. Discussion

This paper presents the performance of five LLMs: gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, gpt-neo-125m, bert-base-uncased and roberta-base. The main objective was to assess how well various LLMs classify student social media comments as depressive/anxious (1) or neutral (0). Three broad research questions were specified in the Introduction section and are now addressed.

RQ1 determined the performance of the five large language models in classifying students’ social media comments into categories of anxiety, depression, or neutral feeling. The results showed that gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125 achieved slightly better than gpt-neo-125m, bert-base-uncased, and roberta-base. Both OpenAI models use transformer architecture with attention mechanisms for text processing. However, GPT-4o introduces notable improvements in capability and efficiency. The gpt-4o-mini-2024-07-18 is positioned as an evolution of gpt-3.5-turbo-0125, incorporating notable improvements in contextual capacity, multimodal processing, and knowledge updates [19].

While both models handle tasks like classification and text generation, gpt-4o-mini-2024-07-18 demonstrates greater readiness to handle more complex tasks, such as analyzing large datasets and multimodal applications [71]. Nevertheless, in the specific context of this study, where large datasets were not used, gpt-3.5-turbo-0125 remained highly competitive, offering precision and efficiency in text-based tasks that placed it very close to gpt-4o-mini-2024-07-18, particularly in emotion classification tasks.

OpenAI models, such as gpt-3.5-turbo-0125 and gpt-4o-mini-2024-07-18, stand out by adapting to tasks like emotion classification without needing large datasets, making them effective in data-scarce contexts. However, they have significant limitations. On the one hand, their implementation can be costly owing to the high computational resource requirements and fees established by OpenAI for their use. However, reliance on OpenAI’s policies poses risks, particularly if terms change unexpectedly.

The slightly lower performance of gpt-neo-125m, bert-base-uncased, and roberta-base on the tests may reflect differences in the design, training capacity, or optimization techniques specific to each model. Nonetheless, the results achieved by these three models are still commendable, as they are open-source and do not require any usage fees.

In fact, roberta-base architecture represents an optimization over BERT’s pretraining, with the use of a larger corpus and a dynamic masking strategy that allows it to capture contextual relationships in a more robust way.

bert-base-uncased, while remaining competitive, showed slightly lower performance than roberta-base, evidencing the limitations of its static pretraining against newer methods. This difference suggests that improvements in the pretraining stage have a tangible impact on sentiment classification tasks, especially in short, informal texts, such as those analyzed.

gpt-neo-125m, however, showed more modest performance. This difference can be attributed to several factors: a considerably smaller model, less refined training corpus, and architecture focused on generation tasks rather than specific classifications. However, despite these limitations, its performance was acceptable considering its lower computational demand, which makes it a viable option in scenarios where resources are limited.

Overall, the results confirm that pretraining, model size, and the nature of the architecture (classification vs. generative) critically influence performance on sentiment classification tasks. While models such as roberta-base and BERT offer consistency in supervised classification, generative LLMs such as gpt-3.5-turbo-0125 and gpt-4o-mini-2024-07-18 provide flexibility, but require careful instruction formulation to maximize their performance. These findings highlight the importance of selecting a model not only based on its size but also attending to the specific type of task. All models analyzed in this study (bert-base-uncased, roberta-base, gpt-3.5-turbo-0125, gpt-4o-mini-2024-07-18, and gpt-neo-125m) were subjected to a fine-tuning process using the student feedback dataset.

The selected corpus was already used in the research of Lopes et al. [80], in which the following machine learning techniques were applied and obtaining the following results considering the precision, recall and F1 score metrics: logistic regression achieved precision, recall, and F1 score of 95%; decision tree scored 91% across all metrics; random forest obtained 92% in precision and recall and 91% in F1 score; naïve Bayes reached 92% in all metrics; support vector machine recorded 93% in precision and recall with an F1 score of 92%; k-nearest neighbors achieved 86% in precision and recall and an F1 score of 82%; and AdaBoost attained 93% in precision, 94% in recall, and 93% in F1 score. The results of the study by Lopes et al. [80] indicate acceptable performance with traditional methods, but also highlight limitations in attempting to capture the complexity of language in these data. In contrast, the present work employed LLMs such as gpt-4o-mini-2024-07-18, gpt-3.5-turbo-0125, gpt-neo-125m, bert-base-uncased, and roberta-base, which achieved notably higher accuracy compared to traditional models. This difference underscores the superior ability of LLMs to detect emotional nuances in educational contexts, indicating the importance of exploring more advanced approaches to sentiment classification in educational contexts.

RQ2 Regulation (EU) 2024/1689 is a worldwide regulation that governs the adoption and use of artificial intelligence [81]. In this context, the aim is to strengthen control over the AI systems used for educational purposes, especially those affecting students’ privacy and academic development [81]. One of the key provisions is the prohibition of systems that manipulate or exploit vulnerabilities, which is especially important in education due to the sensitivity of student data. Safeguarding the privacy of this information is essential to ensuring an ethical and secure educational environment.

Therefore, in the development of AI-based educational applications, the open-source models used in this study, such as gpt-neo-125m, bert-base-uncased, and roberta-base, could be considered where ethical and responsible AI use is key to supporting student learning. These models have demonstrated good performance in natural language processing (NLP) tasks and allow avoidance of associated costs or dependence on APIs from external providers. This alternative ensures institutional control over data, strengthening the trust and robustness of applications aimed at students.

The regulation also promotes secure, transparent, and ethical AI innovation. This approach is particularly relevant in tasks such as sentiment classification, where ethical and responsible AI use is key to supporting student learning.

The integration of sentiment classification using LLMs into an educational system can be implemented in various ways. For instance, online platforms can use these models to analyze student comments. A study conducted by Dyulicheva [82] used student comments from massive open online courses (MOOCs) to identify signs of anxiety toward mathematics subjects. These results could be utilized by MOOC instructors or psychologists to improve the course content and make useful recommendations. Additionally, there are currently available mobile applications that assist in the treatment of depression and anxiety [83], so a student-focused application could be developed and monitored by parents or educators to track their emotional well-being.

On the other hand, social media platforms have become common spaces where students interact more informally and spontaneously. Platforms like Discord [84] and Twitch [85], often used for group study, could integrate chatbots that use LLMs to detect emotions related to anxiety and depression, sending alerts to the users. This way, lack of motivation could be detected in spaces outside the classroom, helping students stay engaged and emotionally supported.

The results obtained in this study not only demonstrate the effectiveness of language models in the emotional classification task but also allow projection of their possible application in real educational contexts. Based on the observed performance, a practical framework for integrating large-scale language models in digital educational environments can be implemented. This framework contemplates its incorporation into learning management platforms (LMS), such as Moodle, with the purpose of automatically identifying emotional risk signals in students’ written comments and participation. The proposal is structured in three phases: (1) passive monitoring of written interactions in forums, assignments, or messages; (2) automatic classification of content using previously trained and adjusted models; and (3) generation of interpretable alerts to teachers, counselors, or support staff. This approach not only respects student privacy but also seeks to provide timely support tools without replacing human judgment.

These models may face limitations when dealing with tasks that require deep reasoning, abstract thinking, or specialized knowledge that only mental health professionals can provide. In this study, the performance of LLMs in classifying students’ emotions was evaluated, but these models are not intended to replace teachers or psychologists, especially for tasks that demand professional intervention. Nevertheless, LLMs can be a useful tool to assist educators in creating individualized support plans, such as adjustments in deadlines or workload for students experiencing high levels of stress.

The effectiveness of LLMs in educational settings highlights how these pedagogical strategies can contribute to the early identification of students who suffer from anxiety or depression, conditions that may negatively impact their academic performance.

6. Conclusions and Future Work

Artificial intelligence models can help personalize education by adapting tasks, pace, and methods to each student’s needs and emotional state. This study addresses two research questions (Section 5). It contributes useful insights for sentiment classification, supported by the strong performance of the five evaluated LLMs.

In an educational setting, applications can integrate LLMs to identify students who may experience anxiety or depression. However, it is essential to consider key aspects such as data privacy and the technologies employed. Models developed by external companies, such as gpt-4o-mini-2024-07-18 and gpt-3.5-turbo-0125, require data to be sent to external servers, which could increase the risk of accidental leaks. In addition, these technologies are often subject to recurring costs. However, models such as gpt-neo-125m, bert-base-uncased, and roberta-base, which are open-source technologies, do not rely on APIs from external vendors or associated fees. This ensures sensitive student data remains within the institution.

Classifying student comments helps with early intervention to protect students’ well-being and academic performance. This study explored the use of LLMs to classify comments left by students on social media applications and posed two objectives that were developed in Section 3 and Section 4. The model that obtained the best results in this study was gpt-4o-mini-2024-07-18, with a performance of 98.93%. However, the rest of the models achieved predictive rates greater than 95% on accuracy and F1 metrics. Therefore, the results were very encouraging, and LLMs show promise for automatic sentiment detection. A fivefold cross-validation experiment demonstrated the stability and robustness of one of the models in this research when classifying feelings. Future work should apply cross-validation to all models to validate their use in education

The findings of this study confirm the ability of language models to accurately identify emotional cues in student comments. Beyond technical performance, the results open up the possibility of incorporating these technologies into real educational environments, such as virtual learning platforms. In particular, the consistent behavior of models supports their use in schemes that allow the automatic analysis of written interventions in forums, messages, or assignments. This information can be translated into early warnings for teachers or counselors, facilitating a timely response to possible manifestations of emotional distress. Far from replacing human labor, the idea is to integrate them to reinforce the academic and personal accompaniment of students. While ethical and privacy issues still need to be addressed, this study provides a solid basis for further exploration of the role of language models in students’ well-being in digital teaching contexts.

Supervised models depend on high-quality training data. However, in many real-world applications, available data are often unlabeled. Labeling data remains a major bottleneck in model development. Expert labeling is costly and slow. Crowdsourced labeling is faster, but often inconsistent due to skill variability. LLMs assist with simple annotation tasks, but their role in complex, open-label problems requiring deep understanding is still underexplored. Irony and idioms challenge sentiment models that lack contextual depth. Future research should explore irony detection techniques.

A key challenge is comparing LLMs with other models to assess their suitability for NLP tasks. Our study emphasizes the need to assess LLMs’ ability to classify comments accurately. Future work could integrate multimodal data (text, images, video) to better understand student emotions.

Author Contributions

Conceptualization, A.P. and E.B.; methodology, A.P.; software, A.P.; validation, A.P. and E.B.; formal analysis, A.P.; investigation, A.P and E.B.; resources, A.P.; data curation, A.P.; writing—original draft preparation, A.P.; writing—review and editing, E.B.; visualization, A.P.; supervision, E.B.; project administration, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The data is public online at https://github.com/Anabel-Pilicita/fine-tuning-sentiment-student (accessed on 29 December 2024) and there is no need for any additional ethic approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

Details of our dataset can be found online at https://github.com/Anabel-Pilicita/fine-tuning-sentiment-student (accessed on 29 December 2024).

Acknowledgments

The authors would like to acknowledge the support of the FUN4DATE (PID2022-136684OB-C22) project funded by the Spanish Agencia Estatal de Investigacion (AEI) 10.13039/501100011033 and TUCAN6-CM (TEC-2024/COM-460), funded by CM (ORDEN 5696/2024).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Vistorte, A.O.R.; Deroncele-Acosta, A.; Ayala, J.L.M.; Barrasa, A.; López-Granero, C.; Martí-González, M. Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Front. Psychol. 2024, 15, 1387089. [Google Scholar] [CrossRef]
Kim, K.-S.; Sin, S.-C.J.; Yoo-Lee, E.Y. Undergraduates’ Use of Social Media as Information Sources. Coll. Res. Libr. 2014, 75, 442–457. [Google Scholar] [CrossRef]
Ali, K.; Dong, H.; Bouguettaya, A.; Erradi, A.; Hadjidj, R. Sentiment Analysis as a Service: A Social Media Based Sentiment Analysis Framework. In Proceedings of the 2017 IEEE 24th International Conference on Web Services (ICWS 2017), Honolulu, HI, USA, 25–30 June 2017; pp. 660–667. [Google Scholar] [CrossRef]
Martín, H.R. Aprendiendo a Aprender: Mejora tu Capacidad de Aprender Descubriendo cómo Aprende tu Cerebro; Vergara: Barcelona, Spain, 2020. [Google Scholar]
World Health Organization. Mental Health of Adolescents. Available online: https://www.who.int/es/news-room/fact-sheets/detail/adolescent-mental-health (accessed on 27 March 2024).
Mahoney, J.L.; Durlak, J.A.; Weissberg, R.P. An update on social and emotional learning outcome research. Phi Delta Kappan 2018, 100, 18–23. [Google Scholar] [CrossRef]
Shaheen, Z.; Wohlgenannt, G.; Filtz, E. Large Scale Legal Text Classification Using Transformer Models. arXiv 2010, arXiv:2010.12871. Available online: http://arxiv.org/abs/2010.12871 (accessed on 27 March 2024).
Li, Q.; Zhao, S.; Zhao, S.; Wen, J. Logistic Regression Matching Pursuit algorithm for text classification. Knowledge-Based Syst. 2023, 277, 110761. [Google Scholar] [CrossRef]
Wang, Y.; Li, X. Mining Product Reviews for Needs-Based Product Configurator Design: A Transfer Learning-Based Approach. IEEE Trans. Ind. Inform. 2021, 17, 6192–6199. [Google Scholar] [CrossRef]
Kang, H.; Yoo, S.J.; Han, D. Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews. Expert Syst. Appl. 2012, 39, 6000–6010. [Google Scholar] [CrossRef]
Paiva, E.; Paim, A.; Ebecken, N. Convolutional Neural Networks and Long Short-Term Memory Networks for Textual Classification of Information Access Requests. IEEE Lat. Am. Trans. 2021, 19, 826–833. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Sun, K.; Luo, X.; Luo, M.Y. A Survey of Pretrained Language Models. In Knowledge Science, Engineering and Management; Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 442–456. [Google Scholar]
Pawar, C.S.; Makwana, A. Comparison of BERT-Base and GPT-3 for Marathi Text Classification; Springer: Singapore, 2022; pp. 563–574. [Google Scholar] [CrossRef]
Qu, Y.; Liu, P.; Song, W.; Liu, L.; Cheng, M. A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2. In Proceedings of the 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 17–19 July 2020; pp. 323–326. [Google Scholar] [CrossRef]
Yang, B.; Luo, X.; Sun, K.; Luo, M.Y. Recent Progress on Text Summarisation Based on BERT and GPT. In Proceedings of the 16th International Conference on Knowledge Science, Engineering and Management, Guangzhou, China, 16–18 August 2023; pp. 225–241. [Google Scholar] [CrossRef]
Cambria, E.; Schuller, B.; Xia, Y.; Havasi, C. New Avenues in Opinion Mining and Sentiment Analysis. IEEE Intell. Syst. 2013, 28, 15–21. [Google Scholar] [CrossRef]
Floridi, L.; Cowls, J. A Unified Framework of Five Principles for AI in Society. Harv. Data Sci. Rev. 2019, 1, 535–545. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. March 2023. Available online: https://cdn.openai.com/papers/gpt-4.pdf (accessed on 27 March 2024).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; in NIPS ’20. Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J.; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. In Proceedings of the BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models, Dublin, Ireland, 27 May 2022; Fan, A., Ilic, S., Wolf, T., Gallé, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 95–136. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies-Proceedings of the Conference; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. Available online: http://arxiv.org/abs/1907.11692 (accessed on 27 March 2024).
Fields, J.; Chovanec, K.; Madiraju, P. A Survey of Text Classification With Transformers: How Wide? How Large? How Long? How Accurate? How Expensive? How Safe? IEEE Access 2024, 12, 6518–6531. [Google Scholar] [CrossRef]
Bansal, M.; Verma, S.; Vig, K.; Kakran, K. Opinion Mining from Student Feedback Data Using Supervised Learning Algorithms. In Third International Conference on Image Processing and Capsule Networks; Chen, J.I.-Z., Tavares, J.M.R.S., Shi, F., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 411–418. [Google Scholar]
Shaik, T.; Tao, X.; Dann, C.; Xie, H.; Li, Y.; Galligan, L. Sentiment analysis and opinion mining on educational data: A survey. Nat. Lang. Process. J. 2023, 2, 100003. [Google Scholar] [CrossRef]
Han, Z.; Wu, J.; Huang, C.; Huang, Q.; Zhao, M. A review on sentiment discovery and analysis of educational big-data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 10, 1328. [Google Scholar] [CrossRef]
Yan, C.; Liu, J.; Liu, W.; Liu, X. Sentiment Analysis and Topic Mining Using a Novel Deep Attention-Based Parallel Dual-Channel Model for Online Course Reviews. Cogn. Comput. 2023, 15, 304–322. [Google Scholar] [CrossRef]
Du, B. Research on the factors influencing the learner satisfaction of MOOCs. Educ. Inf. Technol. 2023, 28, 1935–1955. [Google Scholar] [CrossRef]
Ren, Y.; Tan, X. Research on the Method of Identifying Students’ Online Emotion Based on ALBERT. In Proceedings of the 2021 International Conference on Intelligent Computing, Automation and Applications (ICAA), Nanjing, China, 25–27 June 2021; pp. 646–650. [Google Scholar] [CrossRef]
Fouad, S.; Alkooheji, E. Sentiment Analysis for Women in STEM using Twitter and Transfer Learning Models. In Proceedings of the 17th IEEE International Conference on Semantic Computing, ICSC 2023, Laguna Hills, CA, USA, 1–3 February 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 227–234. [Google Scholar] [CrossRef]
Hameed, M.Y.; Al-Hindi, L.; Ali, S.; Jensen, H.K.; Shoults, C.C. Broadening the Understanding of Medical Students’ Discussion of Radiology Online: A Social Listening Study of Reddit. Curr. Probl. Diagn. Radiol. 2023, 52, 377–382. [Google Scholar] [CrossRef]
Zhou, M.; Mou, H. Tracking public opinion about online education over COVID-19 in China. Educ. Technol. Res. Dev. 2022, 70, 1083–1104. [Google Scholar] [CrossRef]
Jyothsna, R.; Rohini, V.; Paulose, J. Sentiment Analysis of Stress Among the Students Amidst the Covid Pandemic Using Global Tweets. In Ambient Intelligence in Health Care; Swarnkar, T., Patnaik, S., Mitra, P., Misra, S., Mishra, M., Eds.; Springer Nature: Singapore, 2023; pp. 317–324. [Google Scholar]
World Health Organization. Anxiety Disorders. Available online: https://www.who.int/news-room/fact-sheets/detail/anxiety-disorders (accessed on 5 April 2024).
World Health Organization. Depressive Disorder. Available online: https://www.who.int/news-room/fact-sheets/detail/depression (accessed on 13 April 2024).
Mirza, A.A.; Baig, M.; Beyari, G.M.; Halawani, M.A.; Mirza, A.A. Depression and Anxiety Among Medical Students: A Brief Overview. Adv. Med. Educ. Pract. 2021, 12, 393–398. [Google Scholar] [CrossRef]
Hooda, M.; Saini, A. Academic Anxiety: An Overview. Int. J. Educ. Appl. Soc. Sci. 2017, 8, 807–810. [Google Scholar] [CrossRef]
Şad, S.N.; Kış, A.; Demir, M.; Özer, N. Meta-Analysis of the Relationship between Mathematics Anxiety and Mathematics Achievement. Pegem J. Educ. Instr. 2016, 6, 371–392. [Google Scholar] [CrossRef]
Namkung, J.M.; Peng, P.; Lin, X. The Relation Between Mathematics Anxiety and Mathematics Performance Among School-Aged Students: A Meta-Analysis. Rev. Educ. Res. 2019, 89, 459–496. [Google Scholar] [CrossRef]
Barroso, C.; Ganley, C.M.; McGraw, A.L.; Geer, E.A.; Hart, S.A.; Daucourt, M.C. A meta-analysis of the relation between math anxiety and math achievement. Psychol. Bull. 2021, 147, 134–168. [Google Scholar] [CrossRef] [PubMed]
Caviola, S.; Toffalini, E.; Giofrè, D.; Ruiz, J.M.; Szűcs, D.; Mammarella, I.C. Math Performance and Academic Anxiety Forms, from Sociodemographic to Cognitive Aspects: A Meta-analysis on 906,311 Participants. Educ. Psychol. Rev. 2022, 34, 363–399. [Google Scholar] [CrossRef]
Latif, M.M.A. Sources of L2 writing apprehension: A study of Egyptian university students. J. Res. Read. 2015, 38, 194–212. [Google Scholar] [CrossRef]
Nolan, K.; Bergin, S. The role of anxiety when learning to program: A systematic review of the literature. In Proceedings of the 16th Koli Calling International Conference on Computing Education Research, Koli, Finland, 24–27 November 2016; in Koli Calling ’16. Association for Computing Machinery: New York, NY, USA, 2016; pp. 61–70. [Google Scholar] [CrossRef]
Rothman, D. Transformers for Natural Language Processing: Build Innovative Deep Neural; Pack Publishing Ltd.: Birmingham, UK, 2021; Available online: https://books.google.com.ec/books?hl=en&lr=&id=Cr0YEAAAQBAJ&oi=fnd&pg=PP1&ots=a9t6Rt3i21&sig=6AunRon2EtcpjTNULNgtdoA2ODI&redir_esc=y#v=onepage&q&f=false (accessed on 5 April 2024).
Ravichandiran, S. Getting Started with Google BERT: Build and Train State-of-the-Art Natural Language Processing Models Using BERT; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Im, S.-K.; Chan, K.-H. Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer. Mathematics 2024, 12, 997. [Google Scholar] [CrossRef]
Birhane, A.; Kasirzadeh, A.; Leslie, D.; Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 2023, 5, 277–280. [Google Scholar] [CrossRef]
Ataei, T.S.; Javdan, S.; Minaei-Bidgoli, B. Applying Transformers and Aspect-based Sentiment Analysis approaches on Sarcasm Detection. In Proceedings of the Second Workshop on Figurative Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 67–71. [Google Scholar] [CrossRef]
Schmidt, L.; Weeds, J.; Higgins, J.P.T. Data Mining in Clinical Trial Text: Transformers for Classification and Question Answering Tasks. arXiv 2021, arXiv:2001.11268. [Google Scholar] [CrossRef]
Razno, M. Machine Learning Text Classification Model with NLP Approach. In Proceedings of the 3D International Conference Computational Linguistics And Intelligent Systems, Kharkiv, Ukraine, 18–19 April 2019; pp. 71–77. [Google Scholar]
Fanni, S.C.; Febi, M.; Aghakhanyan, G.; Neri, E. Natural Language Processing. In Introduction to Artificial Intelligence; Klontzas, M.E., Fanni, S.C., Neri, E., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 87–99. [Google Scholar] [CrossRef]
Parker, M.J.; Anderson, C.; Stone, C.; Oh, Y. A Large Language Model Approach to Educational Survey Feedback Analysis. Int. J. Artif. Intell. Educ. 2024, 1–38. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://gluebenchmark.com/leaderboard (accessed on 10 May 2024).
Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Unveiling Sustainability in Ecommerce: GPT-Powered Software for Identifying Sustainable Product Features. Sustainability 2023, 15, 12015. [Google Scholar] [CrossRef]
Liu, X.; Zheng, Y.; Du, Z. Journal Pre-proof GPT understands, too. AI Open 2023, 5, 208–215. [Google Scholar] [CrossRef]
Koroteev, M.V. BERT: A Review of Applications in Natural Language Processing and Understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations; International Conference on Learning Representations: New Orleans, LA, USA, 2019. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. Available online: http://arxiv.org/abs/1910.01108 (accessed on 20 January 2024).
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4163–4174. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv 2020, arXiv:2002.10957. [Google Scholar]
Chan, K.-H.; Im, S.-K.; Ke, W. Multiple classifier for concatenate-designed neural network. Neural Comput. Appl. 2022, 34, 1359–1372. [Google Scholar] [CrossRef]
Kaggle. Kaggle: Your Machine Learning and Data Science Community. Available online: https://www.kaggle.com/ (accessed on 8 April 2024).
Jain, A.; Patel, H.; Nagalapatti, L.; Gupta, N.; Mehta, S.; Guttula, S.; Mujumdar, S.; Afzal, S.; Mittal, R.S.; Munigala, V. Overview and Importance of Data Quality for Machine Learning Tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3561–3562. [Google Scholar] [CrossRef]
Meng, Z.; McCreadie, R.; Macdonald, C.; Ounis, I. Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. In Proceedings of the Fourteenth ACM Conference on Recommender Systems, Virtual Event, 22–26 September 2020; ACM: New York, NY, USA, 2020; pp. 681–686. [Google Scholar] [CrossRef]
Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. LLMs in e-commerce: A comparative analysis of GPT and LLaMA models in product review evaluation. Nat. Lang. Process. J. 2024, 6, 100056. [Google Scholar] [CrossRef]
Gorriz, J.M.; Clemente, R.M.; Segovia, F.; Ramirez, J.; Ortiz, A.; Suckling, J. Is K-fold cross validation the best model selection method for Machine Learning? arxiv 2024, arXiv:2401.16407. [Google Scholar]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2021, 2, 225–250. [Google Scholar] [CrossRef]
Tinn, R.; Cheng, H.; Gu, Y.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Fine-tuning large neural language models for biomedical natural language processing. Patterns 2023, 4, 100729. [Google Scholar] [CrossRef]
OpenAI. GPT-4o Mini: Advancing Cost-Efficient Intelligence. OpenAI Blog. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence (accessed on 27 December 2024).
OpenAI. Models—OpenAI API. Available online: https://platform.openai.com/docs/models (accessed on 12 April 2024).
EleutherAI. GPT-Neo—EleutherAI. Available online: https://www.eleuther.ai/artifacts/gpt-neo (accessed on 11 April 2024).
Bert-base-uncased Hugging Face. Available online: https://huggingface.co/bert-base-uncased (accessed on 28 September 2022).
Hugging Face. Pretrained Models—Transformers 3.3.0 Documentation. Available online: https://huggingface.co/transformers/v3.3.1/pretrained_models.html (accessed on 10 April 2024).
Hugging Face. BERT—Transformers 3.0.2 Documentation. Available online: https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#berttokenizer (accessed on 9 April 2024).
Hugging Face. FacebookAI/roberta-base Hugging Face. Available online: https://huggingface.co/FacebookAI/roberta-base (accessed on 10 April 2024).
Moré, J. Evaluación de la Calidad de los Sistemas de Reconocimiento de Sentimientos. Available online: https://openaccess.uoc.edu/bitstream/10609/148645/3/Modulo3_EvaluacionDeLaCalidadDeLosSistemasDeReconocimientoDeSentimientos.pdf (accessed on 25 September 2024).
Li, X.; Zhang, H.; Ouyang, Y.; Zhang, X.; Rong, W. A Shallow BERT-CNN Model for Sentiment Analysis on MOOCs Comments. In Proceedings of the 2019 IEEE International Conference on Engineering, Technology and Education (TALE), Yogyakarta, Indonesia, 10–13 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Lopes, C.; Sahani, S.; Dubey, S.; Tiwari, S.; Yadav, N. Hopeful horizon: Forecast student mental health using data analytics & ml techniques. Int. J. Emerg. Technol. Innov. Res. 2023, 10, 116–125. Available online: https://universalcollegeofengineering.edu.in/wp-content/uploads/2024/03/3.3.1-comps-merged-49-58.pdf (accessed on 16 April 2024).
Regulation (EU) 2024/1689. Official Journal (OJ) of the European Union; European Commission: Brussels, Belgium; Luxembourg, July 2024; Available online: https://artificialintelligenceact.eu/es/ai-act-explorer/ (accessed on 22 February 2025).
Dyulicheva, Y. Learning Analytics in MOOCs as an Instrument for Measuring Math Anxiety. Vopr. Obraz./Educ. Stud. Mosc. 2021, 4, 243–265. [Google Scholar] [CrossRef]
Wasil, A.R.; Venturo-Conerly, K.E.; Shingleton, R.M.; Weisz, J.R. A review of popular smartphone apps for depression and anxiety: Assessing the inclusion of evidence-based content. Behav. Res. Ther. 2019, 123, 103498. [Google Scholar] [CrossRef] [PubMed]
Discord Company. Sobre Discord. Discord. Available online: https://discord.com/company (accessed on 15 March 2021).
Twitch. Keeping Our Community Safe: Twitch 2020 Transparency Report|Twitch Blog. 2020. Available online: https://blog.twitch.tv/es-mx/2021/03/02/keeping-our-community-safe-twitch-2020-transparency-report/ (accessed on 12 March 2021).

Figure 1. Architecture of the proposed methodology.

Figure 2. Dataset student-depression-text.

Figure 3. Fine-tuning JSONL samples.

Table 1. Comment samples.

Sentiment	Comment	Label
Normal	I haven’t showered yet, give me motivation to take a shower	0
Anxiety/depression	Feeling worried, even though you have a God who is ready to help you in any case	1
Anxiety/depression	I want to die. But I’m scared of dying painfully. I want to die peacefully. I hate waking up. I don’t want to feel this way.	1

Table 2. Model performance metric comparison: 80%–20% split.

Model	Accuracy	Precision	Recall	F1
gpt-4o-mini-2024-07-18	0.9893	0.9896	0.9893	0.9894
gpt-3.5-turbo-0125	0.9893	0.9893	0.9893	0.9893
gpt-neo-125m	0.9646	0.9144	0.8835	0.9643
bert-base-uncased	0.9713	0.9176	0.9211	0.9713
roberta-base	0.9813	0.9312	0.9662	0.9814

Table 3. Model performance metrics comparison: 5-fold cross-validation.

5-Fold	Model	Accuracy	Precision	Recall	F1
Average	roberta-base	0.9862	0.9446	0.9738	0.9863
Average	gpt-neo-125m	0.9418	0.9015	0.7213	0.9383

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pilicita, A.; Barra, E. LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification. Multimodal Technol. Interact. 2025, 9, 44. https://doi.org/10.3390/mti9050044

AMA Style

Pilicita A, Barra E. LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification. Multimodal Technologies and Interaction. 2025; 9(5):44. https://doi.org/10.3390/mti9050044

Chicago/Turabian Style

Pilicita, Anabel, and Enrique Barra. 2025. "LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification" Multimodal Technologies and Interaction 9, no. 5: 44. https://doi.org/10.3390/mti9050044

APA Style

Pilicita, A., & Barra, E. (2025). LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification. Multimodal Technologies and Interaction, 9(5), 44. https://doi.org/10.3390/mti9050044

Article Menu

LLMs in Education: Evaluation GPT and BERT Models in Student Comment Classification

Abstract

1. Introduction

2. Literature Review

2.1. Related Works

2.2. Transformers for Large Language Models

2.2.1. Generative Pretrained Transformer (GPT) Model

2.2.2. Bidirectional Encoder Representations from Transformers (BERT)

3. Research Methodology

3.1. Dataset

3.2. Dataset Splitting and Preprocessing

3.3. Model Deployment and Fine-Tuning LLMs

3.3.1. gpt-4o-mini-2024-07-18

3.3.2. gpt-3.5-turbo-0125

3.3.3. gpt-neo-125m

3.3.4. bert-base-uncased

3.3.5. roberta-base

3.4. Evaluation Criteria

4. Results

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI